Explanation of Sentence Classification


Entailed vs. Implied


We subdivided the TRUE sentence pairs further into Entailed and Implied. A hypothesis is Entailed if there is no reasonable world where the text is true and the hypothesis is false. Otherwise, the hypothesis is Implied. Sentence ambiguity was not important for us in this measure; we only considered the intended meanings of the sentence and hypothesis.

Often, the hypothesis would refer to part of a quote. We classified these cases as Implied, since the fact that someone said X does not logically imply X.


Syntactic, Lexical, or More


We categorized all of the sentences by the most complex form of reasoning that it would take to decide the sentence, where Syntactic is the most basic and More is the most complex. For true sentences, this meant the most complex form of reasoning necessary to give a correct justification of the hypothesis. (If it could use more basic reasoning and get a “lucky guess”, that doesn't count.) For false sentences, our standard was much less clear-cut, but we generally picked the most complex form of reasoning that was directly relevant to the hypothesis's not following.

Syntactic hypotheses can be decided purely by syntactic transformations, such as passivization, stemming, appositives, or some forms of coreference (such as matching pronouns with their antecedents). Lexical hypotheses require some knowledge about particular words, including synonym/hypernym/antonym/similarity relations, subcategorization, or nominalizations. Our systems cannot yet deal with subcategorization, but we included it anyway because it can be done through lexical resources such as Cyc. Idioms also fell in this category, since they are at least partially treated by WordNet, and share the same sorts of relations as words do. Those categorized as More require additional knowledge, usually about the world. Often, this means one of the following:



Forms of Reasoning


Next, we highlighted some particular forms of reasoning that were either required, or very helpful, to decide if the hypothesis follows. Our standards for this part were fairly conservative; we would not list a form of reasoning if the decision could be correctly made without it. For instance, we did not list “quantity” for the following pair, since “10 missed opportunities” could be matched exactly:


Text: The report catalogues 10 missed opportunities within the CIA and FBI to uncover pieces of the September 11 plot.

Hypothesis: The report analyzes 10 missed opportunities within the CIA and FBI to uncover pieces of the September 11 plot. (TRUE)


However, we would have listed “quantity” if it were worded as follows, since matching “10” with “ten” requires some knowledge about the quantity:


Text: The report catalogues 10 missed opportunities within the CIA and FBI to uncover pieces of the September 11 plot.

Hypothesis: The report analyzes ten missed opportunities within the CIA and FBI to uncover pieces of the September 11 plot. (TRUE)


For true sentences, as above, we would only list a form of reasoning if it were very helpful in correctly justifying the hypothesis. For false sentences, we listed whatever seemed more or less relevant.


Coreference- recognizing that two noun phrases refer to the same entity. This includes pronouns, as in “Bill parked his car”, so long as the pronoun is part of what should be matched. It also includes recognizing that distinct proper names can refer to the same entity (such as “U.S. Congress” and “Congress”).

Quantity- any form of reasoning involving quantities. This includes knowing that “10” and “ten” are the same, recognizing that $25 is a unit of money, or reasoning about numerical relations. We counted temporal relations too, if the kind of reasoning required was analogous to greater than/less than.

Acronyms- any case that required pairing an acronym with the full title, such as “European Union” with “EU”. We did not count cases where only the acronym was given.

Nominalizations- any case where a verb had to be matched with its nominalization.

Idiom- any case where the meaning of the phrase could not be determined by the meaning of its parts. Examples include “bring in $1,000,000” and “stocks took a dive”.



Examples


Threshold: 8.03


There were some phrases whose meaning is not intersective, in that they somehow weaken the phrase they modify. This includes modifiers to suerlatives:


The slender tower is the second tallest building in Japan.


The slender tower is the tallest building in Japan. (FALSE) Cost: 0


This pair had a cost of 0, since every word in the hypothesis matched perfectly with a word in the text. Ideally, “second” should somehow contribute to raising the score. This mistake happened 3 times.



Here, although not quite negation, the modifier “less” significantly weakens the meaning of “protect”.


Suncreams designed for children could offer less protection than they claim on the bottle.


Suncreams designed for children protect at the level they advertise. (FALSE) Cost: 5.67


In fact, “less” was not even considered in the proof.




For this pair, some knowledge of verb categorization would be helpful. “From” and “to” imply different semantic roles:


Profits nearly doubled to nearly $1.8 billion.


Profits nearly doubled from nearly $1.8 billion. (FALSE) Cost: 0



Sometimes there is a false pair where every word matches up but one, and the high cost of this match gets divided by the number of terms, making our model guess True. Perhaps “strike” and “crash into” also have high similarity scores:


Rich and poor nations struck a historic deal on Sunday to slash billions of dollars in farm subsidies, create more open industrial markets and revive stalled world trade talks that could boost global growth.


Rich and poor nations crashed into a historic deal on Sunday. (FALSE) Cost: 1.71



Here, “Microsoft” was matched with the “Microsoft” in “Microsoft Israel”:


Microsoft Israel was founded in 1989 and became one of the first Microsoft branches outside the USA.


Microsoft was established in 1989. (FALSE) Cost: 1.90



In this case, “complains” was matched with “complaints”, but should not have been:


Wal-Mart has received a lot of negative publicity recently, including allegations that it used illegal workers and made employees work without pay during lunch breaks, as well as complaints that it generally underpays employees.


Wal-Mart complains about negative publicity. (FALSE) Cost: 2.29



“Hit” and “shot” were matched, probably because they are rated as similar.


Kerry hit Bush hard on his conduct on the war in Iraq.


Kerry shot Bush. (FALSE) Cost: 2.51



Not counting an ungrammatical sentence, the four sentences which were assigned the highest score all turned out to be True. The logical leaps required for these were hard enough that hardly any words matched up closely. Ideally, the system should be able to recognize that it doesn't “understand” these sentences and therefore not give them such a high confidence of being False.


Witnesses of genocide attacks on minority Tutsis in Rwanda are being singled out for execution by extremist Hutus in refugee camps in southwest Rwanda, a U.N. spokesman said in Kigali, the capital.


The Hutu and Tutsi groups fought in Rwanda. (TRUE) Cost: 17.01


The Croatian will face competition, especially from the American of Chinese origin, Michael Chang, second-seeded in the tournament. Organizers hope that the tournament will contribute to advancing tennis in their countries.


The Croatian and the Asian-American, Michael Chang (number two) will play each other, and hopefully popularise the game in their native countries. (TRUE) Cost: 18.95


More than 150 dolphins, marine turtles and beaked whales have been washed up dead on beaches in Africa.


Dead dolphins, turtles and whales have been found on African beaches. (TRUE) Cost: 19.60


Ahern, who was travelling to Tokyo for an EU-Japan summit yesterday, will consult with other EU leaders by telephone later this week in an effort to find an agreed candidate.


A summit between Europe and Japan is taking place in the Japanise capital. (TRUE) Cost: 20.26



Almost all of the false negatives with high confidence scores were sentences classified as “More”. This means the syntactic and lexical parts of our system are working fairly well. The highest score for a True sentence that could have been recognized by lexical means is:


Alcohol now fuels 44 percent of all violent crime and 70 percent of Accident and Emergency hospital admissions at peak times are due to booze.


Alcohol has an effect on violent crimes.


It's hard to find other interesting examples of false negatives for our present purposes. The system correctly identified almost all the true hypotheses that could be proved without additional world knowledge.