The Stanford Natural Language Inference (SNLI) Corpus

The Corpus

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

Here are a few example pairs taken from the devtest portion of the corpus. Each has the judgments of five turkers and a consensus judgment.

TextJudgmentsHypothesis
A man inspects the uniform of a figure in some East Asian country. contradiction
C C C C C
The man is sleeping
An older and younger man smiling. neutral
N N E N N
Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people. contradiction
C C C C C
A man is driving down a lonely road.
A soccer game with multiple males playing. entailment
E E E E E
Some men are playing a sport.
A smiling costumed woman is holding an umbrella. neutral
N N E C N
A happy woman in a fairy costume holds an umbrella.

The corpus is distributed in both JSON lines and tab separated value files, which are packaged together (with a readme) here:

Download: SNLI 1.0 (zip, ~100MB)
Creative Commons License
The Stanford Natural Language Inference Corpus by The Stanford NLP Group is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://shannon.cs.illinois.edu/DenotationGraph/.

The corpus includes content from the Flickr 30k corpus (also released under an Attribution-ShareAlike licence), which can be cited by way of this paper:

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 67--78.

About 4k sentences in the training set have captionIDs and pairIDs beginning with 'vg_'. These come from a pilot data collection effort that used data from the VisualGenome corpus, which is still under construction as of the release of SNLI. For more information on VisualGenome, see: https://visualgenome.org/

Published results

The following table reflects our informal attempt to catalog published 3-class classification results on the SNLI test set. We define sentence encoding-based models as those which perform classification on the sole basis of a pair of sentence representations that are computed independently of one another. Reported parameter counts do not include word embeddings. We welcome additional contributions.

Three-way classification

Publication  Model Parameters  Train (% acc)  Test (% acc)
Feature-based models
Bowman et al. '15 Unlexicalized features 49.4 50.4
Bowman et al. '15 + Unigram and bigram features 99.7 78.2
Sentence encoding-based models
Bowman et al. '15 100D LSTM encoders 220k 84.8 77.6
Bowman et al. '16 300D LSTM encoders 3.0m 83.9 80.6
Vendrov et al. '15 1024D GRU encoders w/ unsupervised 'skip-thoughts' pre-training 15m 98.8 81.4
Mou et al. '15 300D Tree-based CNN encoders 3.5m 83.3 82.1
Bowman et al. '16 300D SPINN-PI encoders 3.7m 89.2 83.2
Liu et al. '16 600D (300+300) BiLSTM encoders 2.0m 86.4 83.3
Munkhdalai & Yu '16b 300D NTI-SLSTM-LSTM encoders 4.0m 82.5 83.4
Liu et al. '16 600D (300+300) BiLSTM encoders with intra-attention 2.8m 84.5 84.2
Munkhdalai & Yu '16a 300D NSE encoders 3.0m 86.2 84.6
Other neural network models
Rocktäschel et al. '15 100D LSTMs w/ word-by-word attention 250k 85.3 83.5
Liu et al. '16 600D (300+300) BiLSTM encoders with intra-attention and symbolic preproc. 2.8m 85.9 85.0
Munkhdalai & Yu '16a 300D MMA-NSE encoders with attention 3.2m 86.9 85.4
Wang & Jiang '15 300D mLSTM word-by-word attention model 1.9m 92.0 86.1
Cheng et al. '16 300D LSTMN with deep attention fusion 1.7m 87.3 85.7
Cheng et al. '16 450D LSTMN with deep attention fusion 3.4m 88.5 86.3
Parikh et al. '16 200D decomposable attention model 380k 89.5 86.3
Parikh et al. '16 200D decomposable attention model with intra-sentence attention 580k 90.5 86.8
Munkhdalai & Yu '16b 300D Full tree matching NTI-SLSTM-LSTM w/ global attention 3.2m 88.5 87.3
Chen et al., '16 600D EBIM + 300D Syntactic TreeLSTM 12m 93.0 88.3

Related Papers without Standard Evaluations

Related Research at Stanford NLP

Related Resources at Stanford NLP

Contact Information

For any comments or questions, please email Sam and Gabor.