As we've worked on the RTE system, we've tried to code clearly and document as much as possible on the way. But still, for someone new to the system, it's hard to know where to find everything. So we've put together this primer on the structure of the system, where to find things, and some technical details that might otherwise cause frustrating problems. This should tell you which classes are important; JavaDocs should tell you the rest. Then, we provide a list of some ideas we've had that we haven't gotten to implement yet.
Conceptually, our system works in two stages: the preprocessing, and the inferring. The preprocessing stage includes POS tagging, NER, parsing, transforming the parse trees into collapsed dependency trees, and a few other annotations. This is all done by the RTEEngine
class, which takes as input a kbEval XML file and returns an XML "info file". The inferers, given these info files, then try to "prove" the hypothesis from the passage, given all the information in the info files. Right now, we don't have a single script to run both, though one could be easily written. (Maybe make up a new system diagram?)
The RTEEngine
class does all the classic NLP work. It
reads in a kbEval XML file, whose format is hopefully pretty
self-explanatory. These are kept in
/u/nlp/rte/data/XMLInput/
. The output is what we call an
"info file", for lack of a better name; to get a feel for what these
are, look at the ones in /u/nlp/rte/data/XMLInfoFiles/
. Try
running RTEEngine
with one of these:
java -mx1000m edu.stanford.nlp.rte.RTEEngine -kbEvalIn myKBEvalCorpus.xml -XMLOut myOutFile.xml java -mx1850m edu.stanford.nlp.rte.RTEEngine -useSRL -idf -augmentDependencies -polarity -kbEvalIn myKBEvalCorpus.xml -XMLOut myOutFile.xml
The official versions of the KB Eval development files are in the
files with names ending in _dev.kbe.xml
in the directory
/u/nlp/rte/data/input/kbeformat/
.
The idf
, augmentDependencies
and
polarity
flags turn on some extra annotation, which is of
uncertain value but which you may well want to get full output.
Some of the packages, especially MoveTrees
, take a while to
load, so be patient. If you want to write your own sentences to run
RTEEngine
on, try using the following:
java -mx1000m edu.stanford.nlp.rte.RTEEngine -textFile mySentences.txt -XMLOut myOutFile.xml
Your sentences file should contain two lines, the first of which is the passage, and the second of which is the hypothesis. The InferenceInformation
class handles the actual XML format- the RTE engine and the inferers all interface through it.
Currently, RTEEngine does the following (all in RTEEngine.processOneSentence()):
George Bush
becomes George_Bush
)LexicalizedParser
EnglishGrammaticalRelations
class. Tree.allTypedDependenciesCollapsed()
returns the a list of typed dependencies, which is not necessarily a graph, because extra dependencies are added in for control verbs, relative clauses, and such. If you want a tree without all the extra arcs, use Tree.allTypedDependenciesCollapsed(false)
.RuleBasedPolarityLabeler
gives all verbs and nouns a label corresponding to whether the sentence implies that that event (or object) is true or false, or neither. It uses a set of human-defined rules, which can be found at /u/nlp/rte/data/resources/PolarityLabeler/polarity
. The rules langauge is not extensively tested, so you might not want to play around much with the file, aside from relatively trivial changes./u/nlp/rte/src/parameters/ASSUMEFACTORSFILE__FULL
DependencyGraphAugmentor
, and is only done for the passage, not the hypothesis.(Add stuff about transforming questions.)
You can convert an XML info file into HTML with the following
command, assuming you have $JAVANLP_HOME/samples
on your
CLASSPATH:
java XSLTransform info.xml
/u/apache/htdocs/projects/infer/info_to_html.xml info.html
Just to be confusing, words are stored as two different classes: MapLabel
and FeatureLabel
, both of which are subclasses of AbstractFeatureLabel
. The differences are mostly technical, and most functions (word(), tag(), etc.) are the same for both classes. (Note: use word(), not value(), to get the word.) A sentence is just a list of MapLabel
s and FeatureLabel
s. The typed dependency is is a List<TypedDependency>
, but think of it as an unordered set. The TypedDependency
class contains a relation, a governor, and a dependent. Get the (String
) relation with reln()
. To retrieve the index of the governor in the word table (the dependent is analogous), use
int gov = ((MapLabel)d.gov()).index();
Keep in mind that the word indices are stored from 1
to N
, so to look up the actual word, use wordTable.get(wordIndex-1)
.
Although the RTEEngine
keeps a Tree
object for the parse tree, when InferenceInformation
reads an existing info file, it returns the Penn Treebank String
representation. What if, for some reason, you want the Tree
back? Check out Tree.valueOf(String)
.
Currently, we have three "text inferers", which are basically three ways of searching the space of possible matchings between dependency trees. (We're trying to break the habit of calling them "systems", since they are each a component of the larger system.) The preferred way to run them is the CorpusEvaluator
class:
java -mx1000m edu.stanford.nlp.rte.CorpusEvaluator -info myInfoFile.xml -model [recursive|abduction|graph|combined]
If you want to see the results broken down by task, use the -tasks
, and enter the original PASCAL XML file. (The kbEval XML files don't contain the task labels.)
java -mx1000m edu.stanford.nlp.rte.CorpusEvaluator -info myInfoFile.xml -model [recursive|abduction|graph] -tasks myPascalFile.xml
The PASCAL XML files are kept in /u/nlp/rte/data/input/xmlformat/
and have pascal
in their name. To run just one inference, use -inference myInferenceID
.
For debugging purposes, you might want to compare the CorpusEvaluator
outputs for two different runs to see which answers changed. For this, run
java edu.stanford.nlp.rte.CorpusEvaluatorDiff myOutputFile1 myOutputFile2
It will print all the answers that were correct in the first file but wrong in the second, and vice versa. This is useful if your accuracy suddenly sank 5% and you want to know why, or if you want to compare two inferers or two sets of parameters.
For each lexical resource we use, we have a class that extends AbstractLexicalResource
. Basically, each individual resource has a computeSimilarity()
function which takes two WordTag
s and returns a similarity score from 0 to 1, where 1 is most similar. (Yup. A third class that stores a word.) All of the pairwise similarity scores for each system is stored in a cache, and AbstractLexicalResource
knows how to interact with these caches. Ideally, we should only have to compute all the scores once for each data set.
Right now, all our systems basically match one word to one word. There are some exceptions; multi-word idioms listed in WordNet are collapsed into one "word", so "kick the bucket" becomes "kick_the_bucket", and it would be connected with "die". But it would be nice if we had some way of doing multi-word matchings in general, especially since one of the other PASCAL entries (which one?) scored really well on the IR section using LSA and a bag-of-words approach.
One way to do this could be to create a separate bag-of-words scoring module, which would do things like 1) compute an LSA similarity score between weighted combinations of the word vectors, or 2) find the mutual information score between the phrases in Google. There are probably a number of ways to work such a module into the existing systems. One possibility is to try to match things up with the existing inferers, and feed the leftovers to the bag-of-words scorer. Another possibility, for the recursive inferer, is to compute a bag-of-words score as well as a tree matching score for each pair of subtrees. Our thoughts aren't clearly fleshed out about this.
So far, the weighted abduction and graph matching models have tried to find the least-cost "proof". This has some awkward side-effects. For instance, in matching "Tom jumped fast" with "Tom did not jump", there would be an enormous cost to match the two "jump"s, because of the negation. A "jump" might therefore try to unify with something silly, like "fast". It seems intuitively like the two "jump"s should go together, but get a really high cost.
The way the recursive inferer handles this is to keep two separate scores: a similarity score and a cost. Basically, if two terms are related, they should get a high similarity score, but if they turn out to be antonyms, or if one of them is negated, or if there's a hypernymy relation but in the wrong direction, they should get a high cost. The recursive scorer finds the matching with the highest similarity score, and then computes the cost of that matching. (The two scores use the same scorer; only a few weights are changed.)
This probably raises accuracy slightly, but the main advantage comes in error analysis. How do you know if the system found the optimal proof of "Tom jumped" from "Tom didn't jump"? That makes no sense. Plus, different data sets have different standards for what is actually entailed. But it's intuitively obvious what is similar to what, and it doesn't depend on the data set. So, for tuning weights, a better strategy might be to first make sure the similarity scores are the best they can be, and given those, try to optimize for the data set. This has already been done for the recursive inferer, but should be easy to add to the others.
There's an annoying problem with the way we RecursiveTextInferer
handles reverse entailment. Currently, if the scorer is trying to score two trees that are negated, it returns the same score as if the passage and hypothesis were switched. This makes sense, because P implies Q iff not-Q implies not-P. But the cost of the overall "proof" is still the minimum of the costs of deriving the hypothesis from each of the subtrees of the passage. This has the unfortunate side effect that if the hypothesis is negated, and so is one of the leaves of the passage, the cost of the whole proof will be the cost of deriving that one leaf from the entire hypothesis- which is capped at the cost of assuming one predicate. The same problem will arise eventually in the other two systems as well. Fixing this requires thinking a little more about what weights and similarity scores mean exactly, when entailment should happen in the reverse direction, and when the hypothesis can be derived from a subtree of the passage.
There's another detail the RecursiveTextInferer
doesn't get right: sentences with negative quantifiers, such as "no children went to school". Right now, all that is done is that "children" is labeled false; this doesn't work because "no children went to school" implies "no children went to school by car" but not vice versa. The obvious fix is to transform the sentences so that all the quantifiers are non-negative, and the main verb is possibly negated. For instance, "it is not true that some children went to school."
RTEDependencyGraph
sThe RTEDependencyDag
class has a toString()
method which prints it out nicely: it starts with a node with no ancestors, usually a unique root, and prints all its descendants under it. This doesn't work for RTEDependencyGraph
, however, since there can be cycles, and there aren't necessarily nodes with no ancestors. Right now, it just picks arbitrary nodes and prints them with their descendants, which looks really bad. Since there usually is still a unique root, there's probably some heuristic we could use to print it nicely in the usual case; for example, starting with a node that's not strictly dominated by any other node.