An RTE system primer

As we've worked on the RTE system, we've tried to code clearly and document as much as possible on the way. But still, for someone new to the system, it's hard to know where to find everything. So we've put together this primer on the structure of the system, where to find things, and some technical details that might otherwise cause frustrating problems. This should tell you which classes are important; JavaDocs should tell you the rest. Then, we provide a list of some ideas we've had that we haven't gotten to implement yet.

Documentation

Conceptually, our system works in two stages: the preprocessing, and the inferring. The preprocessing stage includes POS tagging, NER, parsing, transforming the parse trees into collapsed dependency trees, and a few other annotations. This is all done by the RTEEngine class, which takes as input a kbEval XML file and returns an XML "info file". The inferers, given these info files, then try to "prove" the hypothesis from the passage, given all the information in the info files. Right now, we don't have a single script to run both, though one could be easily written. (Maybe make up a new system diagram?)

Preprocessing

The RTEEngine class does all the classic NLP work. It reads in a kbEval XML file, whose format is hopefully pretty self-explanatory. These are kept in /u/nlp/rte/data/XMLInput/. The output is what we call an "info file", for lack of a better name; to get a feel for what these are, look at the ones in /u/nlp/rte/data/XMLInfoFiles/. Try running RTEEngine with one of these:

java -mx1000m edu.stanford.nlp.rte.RTEEngine -kbEvalIn myKBEvalCorpus.xml -XMLOut myOutFile.xml
java -mx1850m edu.stanford.nlp.rte.RTEEngine -useSRL -idf -augmentDependencies -polarity -kbEvalIn myKBEvalCorpus.xml -XMLOut myOutFile.xml

The official versions of the KB Eval development files are in the files with names ending in _dev.kbe.xml in the directory /u/nlp/rte/data/input/kbeformat/.

The idf, augmentDependencies and polarity flags turn on some extra annotation, which is of uncertain value but which you may well want to get full output. Some of the packages, especially MoveTrees, take a while to load, so be patient. If you want to write your own sentences to run RTEEngine on, try using the following:

    java -mx1000m edu.stanford.nlp.rte.RTEEngine -textFile mySentences.txt -XMLOut myOutFile.xml

Your sentences file should contain two lines, the first of which is the passage, and the second of which is the hypothesis. The InferenceInformation class handles the actual XML format- the RTE engine and the inferers all interface through it.

Currently, RTEEngine does the following (all in RTEEngine.processOneSentence()):

Collapses named entities (so George Bush becomes George_Bush)
Other NER labeling (does someone know more details?)
Parses the sentence using LexicalizedParser
Semantic role labeling
Typed dependency tree, which is made by the EnglishGrammaticalRelations class. Tree.allTypedDependenciesCollapsed() returns the a list of typed dependencies, which is not necessarily a graph, because extra dependencies are added in for control verbs, relative clauses, and such. If you want a tree without all the extra arcs, use Tree.allTypedDependenciesCollapsed(false).
Polarity labeling. The class RuleBasedPolarityLabeler gives all verbs and nouns a label corresponding to whether the sentence implies that that event (or object) is true or false, or neither. It uses a set of human-defined rules, which can be found at /u/nlp/rte/data/resources/PolarityLabeler/polarity. The rules langauge is not extensively tested, so you might not want to play around much with the file, aside from relatively trivial changes.
Inverse document frequency, read from the file /u/nlp/rte/src/parameters/ASSUMEFACTORSFILE__FULL
Augments the dependency trees for a couple of other cases. First, some modifiers really seem to modify more than one thing; for instance, "in London" in "I built a house in London" seems to modify both "built" and "house". Second, if A is above B, it adds that B is below A, and vice versa, and same for before/after. This is handled by DependencyGraphAugmentor, and is only done for the passage, not the hypothesis.
Translates the dependency tree into the logical form used by the weighted abduction inferer.

(Add stuff about transforming questions.)

Producing HTML version of preprocessing

You can convert an XML info file into HTML with the following command, assuming you have $JAVANLP_HOME/samples on your CLASSPATH:

java XSLTransform info.xml /u/apache/htdocs/projects/infer/info_to_html.xml info.html

A few technicalities about classes

Just to be confusing, words are stored as two different classes: MapLabel and FeatureLabel, both of which are subclasses of AbstractFeatureLabel. The differences are mostly technical, and most functions (word(), tag(), etc.) are the same for both classes. (Note: use word(), not value(), to get the word.) A sentence is just a list of MapLabels and FeatureLabels. The typed dependency is is a List<TypedDependency>, but think of it as an unordered set. The TypedDependency class contains a relation, a governor, and a dependent. Get the (String) relation with reln(). To retrieve the index of the governor in the word table (the dependent is analogous), use

    int gov = ((MapLabel)d.gov()).index();

Keep in mind that the word indices are stored from 1 to N, so to look up the actual word, use wordTable.get(wordIndex-1).

Although the RTEEngine keeps a Tree object for the parse tree, when InferenceInformation reads an existing info file, it returns the Penn Treebank String representation. What if, for some reason, you want the Tree back? Check out Tree.valueOf(String).

The inferers

Currently, we have three "text inferers", which are basically three ways of searching the space of possible matchings between dependency trees. (We're trying to break the habit of calling them "systems", since they are each a component of the larger system.) The preferred way to run them is the CorpusEvaluator class:

  java -mx1000m edu.stanford.nlp.rte.CorpusEvaluator -info myInfoFile.xml -model [recursive|abduction|graph|combined]

If you want to see the results broken down by task, use the -tasks, and enter the original PASCAL XML file. (The kbEval XML files don't contain the task labels.)

  java -mx1000m edu.stanford.nlp.rte.CorpusEvaluator -info myInfoFile.xml -model [recursive|abduction|graph] -tasks myPascalFile.xml

The PASCAL XML files are kept in /u/nlp/rte/data/input/xmlformat/ and have pascal in their name. To run just one inference, use -inference myInferenceID.

For debugging purposes, you might want to compare the CorpusEvaluator outputs for two different runs to see which answers changed. For this, run

  java edu.stanford.nlp.rte.CorpusEvaluatorDiff myOutputFile1 myOutputFile2

It will print all the answers that were correct in the first file but wrong in the second, and vice versa. This is useful if your accuracy suddenly sank 5% and you want to know why, or if you want to compare two inferers or two sets of parameters.

Lexical resources

For each lexical resource we use, we have a class that extends AbstractLexicalResource. Basically, each individual resource has a computeSimilarity() function which takes two WordTags and returns a similarity score from 0 to 1, where 1 is most similar. (Yup. A third class that stores a word.) All of the pairwise similarity scores for each system is stored in a cache, and AbstractLexicalResource knows how to interact with these caches. Ideally, we should only have to compute all the scores once for each data set.

To-Dos

Phrasal matching

Right now, all our systems basically match one word to one word. There are some exceptions; multi-word idioms listed in WordNet are collapsed into one "word", so "kick the bucket" becomes "kick_the_bucket", and it would be connected with "die". But it would be nice if we had some way of doing multi-word matchings in general, especially since one of the other PASCAL entries (which one?) scored really well on the IR section using LSA and a bag-of-words approach.

One way to do this could be to create a separate bag-of-words scoring module, which would do things like 1) compute an LSA similarity score between weighted combinations of the word vectors, or 2) find the mutual information score between the phrases in Google. There are probably a number of ways to work such a module into the existing systems. One possibility is to try to match things up with the existing inferers, and feed the leftovers to the bag-of-words scorer. Another possibility, for the recursive inferer, is to compute a bag-of-words score as well as a tree matching score for each pair of subtrees. Our thoughts aren't clearly fleshed out about this.

Cost vs. similarity score

So far, the weighted abduction and graph matching models have tried to find the least-cost "proof". This has some awkward side-effects. For instance, in matching "Tom jumped fast" with "Tom did not jump", there would be an enormous cost to match the two "jump"s, because of the negation. A "jump" might therefore try to unify with something silly, like "fast". It seems intuitively like the two "jump"s should go together, but get a really high cost.

The way the recursive inferer handles this is to keep two separate scores: a similarity score and a cost. Basically, if two terms are related, they should get a high similarity score, but if they turn out to be antonyms, or if one of them is negated, or if there's a hypernymy relation but in the wrong direction, they should get a high cost. The recursive scorer finds the matching with the highest similarity score, and then computes the cost of that matching. (The two scores use the same scorer; only a few weights are changed.)

This probably raises accuracy slightly, but the main advantage comes in error analysis. How do you know if the system found the optimal proof of "Tom jumped" from "Tom didn't jump"? That makes no sense. Plus, different data sets have different standards for what is actually entailed. But it's intuitively obvious what is similar to what, and it doesn't depend on the data set. So, for tuning weights, a better strategy might be to first make sure the similarity scores are the best they can be, and given those, try to optimize for the data set. This has already been done for the recursive inferer, but should be easy to add to the others.

Reverse entailment

There's an annoying problem with the way we RecursiveTextInferer handles reverse entailment. Currently, if the scorer is trying to score two trees that are negated, it returns the same score as if the passage and hypothesis were switched. This makes sense, because P implies Q iff not-Q implies not-P. But the cost of the overall "proof" is still the minimum of the costs of deriving the hypothesis from each of the subtrees of the passage. This has the unfortunate side effect that if the hypothesis is negated, and so is one of the leaves of the passage, the cost of the whole proof will be the cost of deriving that one leaf from the entire hypothesis- which is capped at the cost of assuming one predicate. The same problem will arise eventually in the other two systems as well. Fixing this requires thinking a little more about what weights and similarity scores mean exactly, when entailment should happen in the reverse direction, and when the hypothesis can be derived from a subtree of the passage.

There's another detail the RecursiveTextInferer doesn't get right: sentences with negative quantifiers, such as "no children went to school". Right now, all that is done is that "children" is labeled false; this doesn't work because "no children went to school" implies "no children went to school by car" but not vice versa. The obvious fix is to transform the sentences so that all the quantifiers are non-negative, and the main verb is possibly negated. For instance, "it is not true that some children went to school."

Printing `RTEDependencyGraph`s

The RTEDependencyDag class has a toString() method which prints it out nicely: it starts with a node with no ancestors, usually a unique root, and prints all its descendants under it. This doesn't work for RTEDependencyGraph, however, since there can be cycles, and there aren't necessarily nodes with no ancestors. Right now, it just picks arbitrary nodes and prints them with their descendants, which looks really bad. Since there usually is still a unique root, there's probably some heuristic we could use to print it nicely in the usual case; for example, starting with a node that's not strictly dominated by any other node.