Sam Bowman 01/25/2016
Last September at EMNLP 2015, we released the Stanford Natural Language Inference (SNLI) Corpus. We're still excitedly working to build bigger and better machine learning models to use it to its full potential, and we sense that we're not alone, so we're using the launch of the lab's new website to share a bit of what we've learned about the corpus over the last few months.
SNLI is a collection of about half a million natural language inference (NLI) problems. Each problem is a pair of sentences, a premise and a hypothesis, labeled (by hand) with one of three labels: entailment, contradiction, or neutral. An NLI model is a model that attempts to infer the correct label based on the two sentences.
Here's a typical example randomly chosen from the development set:
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: The man is sleeping.
Label: contradiction
The sentences in SNLI are all descriptions of scenes, and photo captions played a large role in data collection. This made it easy for us to collect reliable judgments from untrained annotators, and allowed us to solve the surprisingly difficult problem of coming up with a logically consistent definition of contradiction, so it's what made the huge size of the corpus possible. However, using only that genre of text means that there are several important linguistic phenomena that don't show up in SNLI—things like tense and timeline reasoning or opinions and beliefs. We are interested in going back to collect another inference corpus that goes beyond just single scenes, so stay tuned.
We created SNLI with the goal of making the first high quality NLI dataset large enough to be able to serve as the sole training data set for low-bias machine learning models like neural networks. There are plenty of things one can do with it, but we think it's especially valuable for three things:
If you simply want to browse the corpus, the corpus page contains several examples and a download link. If you want to see the basic key statistics about the size of the corpus and how it was annotated, the corpus paper has that information. For this post, we thought it would be helpful to do a quick quantitative breakdown of what kinds of phenomena tend to show up in the corpus.
In particular, we tagged 100 randomly sampled sentence pairs from the test set by hand with labels denoting a handful of phenomena that we found interesting. These phenomena are not mutually exclusive, and the count of each phenomenon can be treated as a very rough estimate of its frequency in the overall corpus.
Full sentences and bare noun phrases: SNLI is a mix of full sentences (There is a duck) and bare noun phrases (A duck in a pond). Using the labels from the Stanford parser, we found that full sentences are more common, and that noun phrases mostly occur in pairs with full sentences.
Insertions: One strategy for creating pairs that turned out to be especially popular among annotators trying to create neutral pairs is to create a hypothesis that mostly draws text from the premise, but that adds a prepositional phrase (There is a duck to There is a duck in a pond) or an adjective or adverb (There is a duck to There is a large duck).
Lexical relations: One of they key building blocks for logical inference systems like those studied in natural logic is an ability to reason about relationships like entailment or contradiction between individual words. In many examples of sentence level entailment, this kind of reasoning makes up a substantial part of the problem, as in There is a duck by the pond–There is a bird near water. We measured the frequency of this phenomenon by counting the number of examples in which a pair of words falling into an entailment or contradiction relationship (in either direction) could be reasonably aligned between the premise and the hypothesis.
Commonsense world knowledge: Unlike in earlier FRACAS entailment data, SNLI contains many examples which can be difficult to judge without access to contingent facts about the world that go beyond lexical relationships, as in examples like A girl makes a snow angel–A girl is playing in snow, where it is necessary to know explicitly that snow angels are made by playing in snow.
Multi-word expressions: Multi-word expressions with non-compositional meanings (or, loosely speaking, idioms) complicate the construction and evaluation of models like RNNs that take words as input. SICK, the earlier dataset that inspired our work, explicitly excludes any such multi-word expressions. We did not find them to be especially common.
Pronoun coreference/anaphora: Reference (or anaphora) from a pronoun in the hypothesis to an expression in the premise, as in examples like The duck was swimming—It was in the water, can create additional complexity for inference systems, especially when there are multiple possible referents. We found only a handful of such cases.
Negation: One simple way to create a hypothesis that contradicts some premise is to copy the premise and add any of several kinds of negation, as in There is a duck–There is not a duck. This approach to creating contradictions is extremely easy to detect, and was somewhat common in the SICK entailment corpus. We measure the frequency of this phenomenon by counting the number of sentence pairs where the hypothesis and the premise can be at least loosely aligned, and that the hypothesis uses any kind of negation in a position that does not align to any negation in the premise.
Common templates: Beside what came up above, two common techniques that annotators used to build sentence pairs were to either come up with a complete non sequitur (usually marked contradiction) or to pick out one entity from the premise and to compose a sentence of the form there {is, are} X. Together, these two templates make up a few percent of the corpus.
Mistakes: The corpus wasn't edited for spelling or grammar, so there are typos.
We have become aware of several papers that have been released in recent months that evaluate models on SNLI (mostly thanks to Google Scholar), and we've collected all of the papers that we're aware of on the corpus page. The overall state of the art right now is 86.1% classification accuracy from Shuohang Wang and Jing Jiang at Singapore Management University, using a clever variant of a sequence-to-sequence neural network model with soft attention. Lili Mou et al. at Peking University and Baidu Beijing deserve an honorable mention for creating the most effective model that reasons over a single fixed-size vector representation for each sentence, rather than constructing word-by-word alignments as with attention. They reach 82.1% accuracy. There are two other papers on the corpus page as well with their own insights about NLI modeling with neural networks, so have a look there before setting off on your own with the corpus.
Google's Mat Kelcey has some simple experiments on SNLI posted as well here and here. While these experiments don't reach the state of the art, they include Theano and Tensorflow code, and so may be a useful starting point for those building their own models.