The Stanford NLI Corpus Revisited

Sam Bowman 01/25/2016

Last September at EMNLP 2015, we released the Stanford Natural Language Inference (SNLI) Corpus. We're still excitedly working to build bigger and better machine learning models to use it to its full potential, and we sense that we're not alone, so we're using the launch of the lab's new website to share a bit of what we've learned about the corpus over the last few months.

What is SNLI?

SNLI is a collection of about half a million natural language inference (NLI) problems. Each problem is a pair of sentences, a premise and a hypothesis, labeled (by hand) with one of three labels: entailment, contradiction, or neutral. An NLI model is a model that attempts to infer the correct label based on the two sentences.

Here's a typical example randomly chosen from the development set:

Premise: A man inspects the uniform of a figure in some East Asian country.

Hypothesis: The man is sleeping.

Label: contradiction

The sentences in SNLI are all descriptions of scenes, and photo captions played a large role in data collection. This made it easy for us to collect reliable judgments from untrained annotators, and allowed us to solve the surprisingly difficult problem of coming up with a logically consistent definition of contradiction, so it's what made the huge size of the corpus possible. However, using only that genre of text means that there are several important linguistic phenomena that don't show up in SNLI—things like tense and timeline reasoning or opinions and beliefs. We are interested in going back to collect another inference corpus that goes beyond just single scenes, so stay tuned.

What can I do with it?

We created SNLI with the goal of making the first high quality NLI dataset large enough to be able to serve as the sole training data set for low-bias machine learning models like neural networks. There are plenty of things one can do with it, but we think it's especially valuable for three things:

Training practical NLI systems: NLI is a major open problem in NLP, and many approaches to applied tasks like summarization, information retrieval, and question answering rely on high-quality NLI.
Corpus semantics: SNLI is unusual among corpora for natural language understanding tasks in that it was annotated by non-experts without any annotation manual, such that its labels reflect the intuitive judgments of the annotators about what each sentence means. This makes it well suited for work in quantitative corpus linguistics, and makes it one of few corpora that allow researchers in Linguistics to apply corpus methods to questions about what sentences mean, rather than just what kinds of sentences people use.
Evaluating sentence encoding models: There has been a great deal of recent research on how best to build supervised neural network models that extract vector representations of sentences that capture their meanings. Since SNLI is large enough to serve as a training set for such models, and since modeling NLI within a neural network requires highly informative meaning representations (more so than previous focus tasks like sentiment analysis), we think that SNLI is especially well suited to be a target evaluation task for this kind of research.

What does it look like?

If you simply want to browse the corpus, the corpus page contains several examples and a download link. If you want to see the basic key statistics about the size of the corpus and how it was annotated, the corpus paper has that information. For this post, we thought it would be helpful to do a quick quantitative breakdown of what kinds of phenomena tend to show up in the corpus.

In particular, we tagged 100 randomly sampled sentence pairs from the test set by hand with labels denoting a handful of phenomena that we found interesting. These phenomena are not mutually exclusive, and the count of each phenomenon can be treated as a very rough estimate of its frequency in the overall corpus.

Full sentences and bare noun phrases: SNLI is a mix of full sentences (There is a duck) and bare noun phrases (A duck in a pond). Using the labels from the Stanford parser, we found that full sentences are more common, and that noun phrases mostly occur in pairs with full sentences.

Sentence–sentence pairs: 71 (23 ent., 28 neut., 20 contr.)
Sentence–bare NP pairs (either order): 27 (10 ent., 9 neut., 8 contr.)
Bare NP–bare NP pairs: 3 (0 ent., 2 neut., 1 contr.)

Insertions: One strategy for creating pairs that turned out to be especially popular among annotators trying to create neutral pairs is to create a hypothesis that mostly draws text from the premise, but that adds a prepositional phrase (There is a duck to There is a duck in a pond) or an adjective or adverb (There is a duck to There is a large duck).

Insertions of a restrictive PP: 4 (0 ent., 4 neut., 0 contr.)
Insertions of a restrictive adjective or adverb: 5 (1 ent., 4 neut., 0 contr.)

Lexical relations: One of they key building blocks for logical inference systems like those studied in natural logic is an ability to reason about relationships like entailment or contradiction between individual words. In many examples of sentence level entailment, this kind of reasoning makes up a substantial part of the problem, as in There is a duck by the pond–There is a bird near water. We measured the frequency of this phenomenon by counting the number of examples in which a pair of words falling into an entailment or contradiction relationship (in either direction) could be reasonably aligned between the premise and the hypothesis.

Aligned lexical entailment or contradiction pairs: 28 (5 ent., 11 neut., 12 contr.)

Commonsense world knowledge: Unlike in earlier FRACAS entailment data, SNLI contains many examples which can be difficult to judge without access to contingent facts about the world that go beyond lexical relationships, as in examples like A girl makes a snow angel–A girl is playing in snow, where it is necessary to know explicitly that snow angels are made by playing in snow.

Inferences requiring commonsense world knowledge: 47 (17 ent., 18 neut., 12 contr.)

Multi-word expressions: Multi-word expressions with non-compositional meanings (or, loosely speaking, idioms) complicate the construction and evaluation of models like RNNs that take words as input. SICK, the earlier dataset that inspired our work, explicitly excludes any such multi-word expressions. We did not find them to be especially common.

Sentence pairs containing non-compositional multi-word expressions: 2 (1 ent., 1 neut., 0 contr.)

Pronoun coreference/anaphora: Reference (or anaphora) from a pronoun in the hypothesis to an expression in the premise, as in examples like The duck was swimming—It was in the water, can create additional complexity for inference systems, especially when there are multiple possible referents. We found only a handful of such cases.

Instances of pronoun coreference: 3 (0 ent., 2 neut., 1 contr.)

Negation: One simple way to create a hypothesis that contradicts some premise is to copy the premise and add any of several kinds of negation, as in There is a duck–There is not a duck. This approach to creating contradictions is extremely easy to detect, and was somewhat common in the SICK entailment corpus. We measure the frequency of this phenomenon by counting the number of sentence pairs where the hypothesis and the premise can be at least loosely aligned, and that the hypothesis uses any kind of negation in a position that does not align to any negation in the premise.

Insertions of negation: 1 (0 ent., 0 neut., 1 contr.)

Common templates: Beside what came up above, two common techniques that annotators used to build sentence pairs were to either come up with a complete non sequitur (usually marked contradiction) or to pick out one entity from the premise and to compose a sentence of the form there {is, are} X. Together, these two templates make up a few percent of the corpus.

Non-sequitur/unrelated sentence pairs: 2 (0 ent., 0 neut., 2 contr.)
"There {is, are} X" hypotheses: 3 (3 ent., 0 neut., 0 contr.)

Mistakes: The corpus wasn't edited for spelling or grammar, so there are typos.

Examples with a single-word typo in either sentence: 3 (0 ent., 3 neut., 0 contr.)
Examples with a grammatical error or nonstandard grammar in either sentence: 9 (3 ent., 4 neut., 2 contr.)

What is the state of the art right now?

We have become aware of several papers that have been released in recent months that evaluate models on SNLI (mostly thanks to Google Scholar), and we've collected all of the papers that we're aware of on the corpus page. The overall state of the art right now is 86.1% classification accuracy from Shuohang Wang and Jing Jiang at Singapore Management University, using a clever variant of a sequence-to-sequence neural network model with soft attention. Lili Mou et al. at Peking University and Baidu Beijing deserve an honorable mention for creating the most effective model that reasons over a single fixed-size vector representation for each sentence, rather than constructing word-by-word alignments as with attention. They reach 82.1% accuracy. There are two other papers on the corpus page as well with their own insights about NLI modeling with neural networks, so have a look there before setting off on your own with the corpus.

Google's Mat Kelcey has some simple experiments on SNLI posted as well here and here. While these experiments don't reach the state of the art, they include Theano and Tensorflow code, and so may be a useful starting point for those building their own models.

Research Blog

The Stanford NLI Corpus Revisited

What is SNLI?

What can I do with it?

What does it look like?

What is the state of the art right now?