The Stanford Natural Language Inference (SNLI) Corpus

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the task of determining the inference relation between two (short, ordered) texts: entailment, contradiction, or neutral (MacCartney and Manning 2008).

The Corpus

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

Here are a few example pairs taken from the development portion of the corpus. Each has the judgments of five mechanical turk workers and a consensus judgment.

Text Judgments Hypothesis

A man inspects the uniform of a figure in some East Asian country. contradiction
C C C C C The man is sleeping

An older and younger man smiling. neutral
N N E N N Two men are smiling and laughing at the cats playing on the floor.

A black race car starts up in front of a crowd of people. contradiction
C C C C C A man is driving down a lonely road.

A soccer game with multiple males playing. entailment
E E E E E Some men are playing a sport.

A smiling costumed woman is holding an umbrella. neutral
N N E C N A happy woman in a fairy costume holds an umbrella.

Text	Judgments	Hypothesis
A man inspects the uniform of a figure in some East Asian country.	contradiction C C C C C	The man is sleeping
An older and younger man smiling.	neutral N N E N N	Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.	contradiction C C C C C	A man is driving down a lonely road.
A soccer game with multiple males playing.	entailment E E E E E	Some men are playing a sport.
A smiling costumed woman is holding an umbrella.	neutral N N E C N	A happy woman in a fairy costume holds an umbrella.

The corpus is distributed in both JSON lines and tab separated value files, which are packaged together (with a readme) here:

Download: SNLI 1.0 (zip, ~100MB)

SNLI is archived at the NYU Faculty Digital Archive.

The Stanford Natural Language Inference Corpus by The Stanford NLP Group is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://shannon.cs.illinois.edu/DenotationGraph/.

The corpus includes content from the Flickr 30k corpus (also released under an Attribution-ShareAlike licence), which can be cited by way of this paper:

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 67--78.

About 4k sentences in the training set have captionIDs and pairIDs beginning with 'vg_'. These come from a pilot data collection effort that used data from the VisualGenome corpus, which wass still under construction at the time of the release of SNLI. For more information on VisualGenome, see: https://visualgenome.org/.

The hard subset of the test set used in Gururangan et al. 2018 is available in JSONL format here.

Dataset Card

For key information for those considering building applications on this data, see this dataset card created by the Hugging Face Datasets team.

Published results

The following table reflects our informal attempt to catalog published 3-class classification results on the SNLI test set. We define sentence vector-based models as those which perform classification on the sole basis of a pair of fixed-size sentence representations that are computed independently of one another. Reported parameter counts do not include word embeddings. If you would like to add a paper that reports a number at or above the current state of the art, email Sam.

Three-way classification

Publication Model Parameters Train (% acc) Test (% acc)

Feature-based models

Bowman et al. '15 Unlexicalized features 49.4 50.4

Bowman et al. '15 + Unigram and bigram features 99.7 78.2

Sentence vector-based models

Bowman et al. '15 100D LSTM encoders 220k 84.8 77.6

Bowman et al. '16 300D LSTM encoders 3.0m 83.9 80.6

Vendrov et al. '15 1024D GRU encoders w/ unsupervised 'skip-thoughts' pre-training 15m 98.8 81.4

Mou et al. '15 300D Tree-based CNN encoders 3.5m 83.3 82.1

Bowman et al. '16 300D SPINN-PI encoders 3.7m 89.2 83.2

Yang Liu et al. '16 600D (300+300) BiLSTM encoders 2.0m 86.4 83.3

Munkhdalai & Yu '16b 300D NTI-SLSTM-LSTM encoders 4.0m 82.5 83.4

Yang Liu et al. '16 600D (300+300) BiLSTM encoders with intra-attention 2.8m 84.5 84.2

Conneau et al. '17 4096D BiLSTM with max-pooling 40m 85.6 84.5

Munkhdalai & Yu '16a 300D NSE encoders 3.0m 86.2 84.6

Qian Chen et al. '17 600D (300+300) Deep Gated Attn. BiLSTM encoders (code) 12m 90.5 85.5

Tao Shen et al. '17 300D Directional self-attention network encoders (code) 2.4m 91.1 85.6

Jihun Choi et al. '17 300D Gumbel TreeLSTM encoders 2.9m 91.2 85.6

Nie and Bansal '17 300D Residual stacked encoders 9.7m 89.8 85.7

Anonymous '18 1200D REGMAPR (Base+Reg) – – 85.9

Yi Tay et al. '18 300D CAFE (no cross-sentence attention) 3.7m 87.3 85.9

Jihun Choi et al. '17 600D Gumbel TreeLSTM encoders 10m 93.1 86.0

Nie and Bansal '17 600D Residual stacked encoders 29m 91.0 86.0

Tao Shen et al. '18 300D Reinforced Self-Attention Network 3.1m 92.6 86.3

Im and Cho '17 Distance-based Self-Attention Network 4.7m 89.6 86.3

Seonhoon Kim et al. '18 Densely-Connected Recurrent and Co-Attentive Network (encoder) 5.6m 91.4 86.5

Talman et al. '18 600D Hierarchical BiLSTM with Max Pooling (HBMP, code) 22m 89.9 86.6

Qian Chen et al. '18 600D BiLSTM with generalized pooling 65m 94.9 86.6

Kiela et al. '18 512D Dynamic Meta-Embeddings 9m 91.6 86.7

Deunsol Yoon et al. '18 600D Dynamic Self-Attention Model 2.1m 87.3 86.8

Deunsol Yoon et al. '18 2400D Multiple-Dynamic Self-Attention Model 7.0m 89.0 87.4

Other neural network models (usually with attention between text and hypothesis words)

Rocktäschel et al. '15 100D LSTMs w/ word-by-word attention 250k 85.3 83.5

Pengfei Liu et al. '16a 100D DF-LSTM 320k 85.2 84.6

Yang Liu et al. '16 600D (300+300) BiLSTM encoders with intra-attention and symbolic preproc. 2.8m 85.9 85.0

Pengfei Liu et al. '16b 50D stacked TC-LSTMs 190k 86.7 85.1

Munkhdalai & Yu '16a 300D MMA-NSE encoders with attention 3.2m 86.9 85.4

Wang & Jiang '15 300D mLSTM word-by-word attention model 1.9m 92.0 86.1

Jianpeng Cheng et al. '16 300D LSTMN with deep attention fusion 1.7m 87.3 85.7

Jianpeng Cheng et al. '16 450D LSTMN with deep attention fusion 3.4m 88.5 86.3

Parikh et al. '16 200D decomposable attention model 380k 89.5 86.3

Parikh et al. '16 200D decomposable attention model with intra-sentence attention 580k 90.5 86.8

Munkhdalai & Yu '16b 300D Full tree matching NTI-SLSTM-LSTM w/ global attention 3.2m 88.5 87.3

Zhiguo Wang et al. '17 BiMPM 1.6m 90.9 87.5

Lei Sha et al. '16 300D re-read LSTM 2.0m 90.7 87.5

Yichen Gong et al. '17 448D Densely Interactive Inference Network (DIIN, code) 4.4m 91.2 88.0

McCann et al. '17 Biattentive Classification Network + CoVe + Char 22m 88.5 88.1

Chuanqi Tan et al. '18 150D Multiway Attention Network 14m 94.5 88.3

Xiaodong Liu et al. '18 Stochastic Answer Network 3.5m 93.3 88.5

Ghaeini et al. '18 450D DR-BiLSTM 7.5m 94.1 88.5

Yi Tay et al. '18 300D CAFE 4.7m 89.8 88.5

Qian Chen et al. '17 KIM 4.3m 94.1 88.6

Qian Chen et al. '16 600D ESIM + 300D Syntactic TreeLSTM (code) 7.7m 93.5 88.6

Peters et al. '18 ESIM + ELMo 8.0m 91.6 88.7

Boyuan Pan et al. '18 300D DMAN 9.2m 95.4 88.8

Zhiguo Wang et al. '17 BiMPM Ensemble 6.4m 93.2 88.8

Yichen Gong et al. '17 448D Densely Interactive Inference Network (DIIN, code) Ensemble 17m 92.3 88.9

Seonhoon Kim et al. '18 Densely-Connected Recurrent and Co-Attentive Network 6.7m 93.1 88.9

Qian Chen et al. '17 KIM Ensemble 43m 93.6 89.1

Ghaeini et al. '18 450D DR-BiLSTM Ensemble 45m 94.8 89.3

Peters et al. '18 ESIM + ELMo Ensemble 40m 92.1 89.3

Yi Tay et al. '18 300D CAFE Ensemble 17.5m 92.5 89.3

Chuanqi Tan et al. '18 150D Multiway Attention Network Ensemble 58m 95.5 89.4

Boyuan Pan et al. '18 300D DMAN Ensemble 79m 96.1 89.6

Radford et al. '18 Fine-Tuned LM-Pretrained Transformer 85m 96.6 89.9

Seonhoon Kim et al. '18 Densely-Connected Recurrent and Co-Attentive Network Ensemble 53.3m 95.0 90.1

Zhuosheng Zhang et al. '19a SJRC (BERT-Large +SRL) 308m 95.7 91.3

Xiaodong Liu et al. '19 MT-DNN 330m 97.2 91.6

Zhousheng Zhang et al. '19b SemBERT 339m 94.4 91.9

Pilault et al. '20 CA-MTL 340m 92.6 92.1

Sun et al., ’20 RoBERTa-large + self-explaining layer 355m+ ? 92.3

Wang et al., ’21 EFL (Entailment as Few-shot Learner) + RoBERTa-large 355m ? 93.1

Publication	Model	Parameters	Train (% acc)	Test (% acc)
Feature-based models
Bowman et al. '15	Unlexicalized features		49.4	50.4
Bowman et al. '15	+ Unigram and bigram features		99.7	78.2
Sentence vector-based models
Bowman et al. '15	100D LSTM encoders	220k	84.8	77.6
Bowman et al. '16	300D LSTM encoders	3.0m	83.9	80.6
Vendrov et al. '15	1024D GRU encoders w/ unsupervised 'skip-thoughts' pre-training	15m	98.8	81.4
Mou et al. '15	300D Tree-based CNN encoders	3.5m	83.3	82.1
Bowman et al. '16	300D SPINN-PI encoders	3.7m	89.2	83.2
Yang Liu et al. '16	600D (300+300) BiLSTM encoders	2.0m	86.4	83.3
Munkhdalai & Yu '16b	300D NTI-SLSTM-LSTM encoders	4.0m	82.5	83.4
Yang Liu et al. '16	600D (300+300) BiLSTM encoders with intra-attention	2.8m	84.5	84.2
Conneau et al. '17	4096D BiLSTM with max-pooling	40m	85.6	84.5
Munkhdalai & Yu '16a	300D NSE encoders	3.0m	86.2	84.6
Qian Chen et al. '17	600D (300+300) Deep Gated Attn. BiLSTM encoders (code)	12m	90.5	85.5
Tao Shen et al. '17	300D Directional self-attention network encoders (code)	2.4m	91.1	85.6
Jihun Choi et al. '17	300D Gumbel TreeLSTM encoders	2.9m	91.2	85.6
Nie and Bansal '17	300D Residual stacked encoders	9.7m	89.8	85.7
Anonymous '18	1200D REGMAPR (Base+Reg)	–	–	85.9
Yi Tay et al. '18	300D CAFE (no cross-sentence attention)	3.7m	87.3	85.9
Jihun Choi et al. '17	600D Gumbel TreeLSTM encoders	10m	93.1	86.0
Nie and Bansal '17	600D Residual stacked encoders	29m	91.0	86.0
Tao Shen et al. '18	300D Reinforced Self-Attention Network	3.1m	92.6	86.3
Im and Cho '17	Distance-based Self-Attention Network	4.7m	89.6	86.3
Seonhoon Kim et al. '18	Densely-Connected Recurrent and Co-Attentive Network (encoder)	5.6m	91.4	86.5
Talman et al. '18	600D Hierarchical BiLSTM with Max Pooling (HBMP, code)	22m	89.9	86.6
Qian Chen et al. '18	600D BiLSTM with generalized pooling	65m	94.9	86.6
Kiela et al. '18	512D Dynamic Meta-Embeddings	9m	91.6	86.7
Deunsol Yoon et al. '18	600D Dynamic Self-Attention Model	2.1m	87.3	86.8
Deunsol Yoon et al. '18	2400D Multiple-Dynamic Self-Attention Model	7.0m	89.0	87.4
Other neural network models (usually with attention between text and hypothesis words)
Rocktäschel et al. '15	100D LSTMs w/ word-by-word attention	250k	85.3	83.5
Pengfei Liu et al. '16a	100D DF-LSTM	320k	85.2	84.6
Yang Liu et al. '16	600D (300+300) BiLSTM encoders with intra-attention and symbolic preproc.	2.8m	85.9	85.0
Pengfei Liu et al. '16b	50D stacked TC-LSTMs	190k	86.7	85.1
Munkhdalai & Yu '16a	300D MMA-NSE encoders with attention	3.2m	86.9	85.4
Wang & Jiang '15	300D mLSTM word-by-word attention model	1.9m	92.0	86.1
Jianpeng Cheng et al. '16	300D LSTMN with deep attention fusion	1.7m	87.3	85.7
Jianpeng Cheng et al. '16	450D LSTMN with deep attention fusion	3.4m	88.5	86.3
Parikh et al. '16	200D decomposable attention model	380k	89.5	86.3
Parikh et al. '16	200D decomposable attention model with intra-sentence attention	580k	90.5	86.8
Munkhdalai & Yu '16b	300D Full tree matching NTI-SLSTM-LSTM w/ global attention	3.2m	88.5	87.3
Zhiguo Wang et al. '17	BiMPM	1.6m	90.9	87.5
Lei Sha et al. '16	300D re-read LSTM	2.0m	90.7	87.5
Yichen Gong et al. '17	448D Densely Interactive Inference Network (DIIN, code)	4.4m	91.2	88.0
McCann et al. '17	Biattentive Classification Network + CoVe + Char	22m	88.5	88.1
Chuanqi Tan et al. '18	150D Multiway Attention Network	14m	94.5	88.3
Xiaodong Liu et al. '18	Stochastic Answer Network	3.5m	93.3	88.5
Ghaeini et al. '18	450D DR-BiLSTM	7.5m	94.1	88.5
Yi Tay et al. '18	300D CAFE	4.7m	89.8	88.5
Qian Chen et al. '17	KIM	4.3m	94.1	88.6
Qian Chen et al. '16	600D ESIM + 300D Syntactic TreeLSTM (code)	7.7m	93.5	88.6
Peters et al. '18	ESIM + ELMo	8.0m	91.6	88.7
Boyuan Pan et al. '18	300D DMAN	9.2m	95.4	88.8
Zhiguo Wang et al. '17	BiMPM Ensemble	6.4m	93.2	88.8
Yichen Gong et al. '17	448D Densely Interactive Inference Network (DIIN, code) Ensemble	17m	92.3	88.9
Seonhoon Kim et al. '18	Densely-Connected Recurrent and Co-Attentive Network	6.7m	93.1	88.9
Qian Chen et al. '17	KIM Ensemble	43m	93.6	89.1
Ghaeini et al. '18	450D DR-BiLSTM Ensemble	45m	94.8	89.3
Peters et al. '18	ESIM + ELMo Ensemble	40m	92.1	89.3
Yi Tay et al. '18	300D CAFE Ensemble	17.5m	92.5	89.3
Chuanqi Tan et al. '18	150D Multiway Attention Network Ensemble	58m	95.5	89.4
Boyuan Pan et al. '18	300D DMAN Ensemble	79m	96.1	89.6
Radford et al. '18	Fine-Tuned LM-Pretrained Transformer	85m	96.6	89.9
Seonhoon Kim et al. '18	Densely-Connected Recurrent and Co-Attentive Network Ensemble	53.3m	95.0	90.1
Zhuosheng Zhang et al. '19a	SJRC (BERT-Large +SRL)	308m	95.7	91.3
Xiaodong Liu et al. '19	MT-DNN	330m	97.2	91.6
Zhousheng Zhang et al. '19b	SemBERT	339m	94.4	91.9
Pilault et al. '20	CA-MTL	340m	92.6	92.1
Sun et al., ’20	RoBERTa-large + self-explaining layer	355m+	?	92.3
Wang et al., ’21	EFL (Entailment as Few-shot Learner) + RoBERTa-large	355m	?	93.1

Related Resources

A spell-checked version of the test and development sets. (Warning: Results on these sets are not directly comparable to results on the regular dev and test sets, and will not be listed here.)
The MultiGenre NLI (MultiNLI or MNLI) Corpus. The corpus is in the same format as SNLI and is comparable in size, but it includes a more diverse variety of text styles and topics, as well as an auxiliary test set for cross-genre transfer evaluation.
The FraCaS test suite for natural language inference, in XML format
MedNLI: A Natural Language Inference Dataset For The Clinical Domain
XNLI: A Cross-Lingual Natural Language Inference Evaluation Set
e-SNLI: Explanation annotations over SNLI.

Contact Information

For any comments or questions, please email Sam, Gabor, and Chris.