Recursive Text Inferer

Here's a summary of many of the ideas we (me, Teg, Anna, Josh, in various combinations) have toyed with over the past couple weeks. Most of these are still in the formative stages, although the core algorithm and parts of the scorer are already implemented.

Key advantages

The recursive model

We can represent a sentence as a typed directed acyclic graph whose vertices are words of the sentence (or named entities), and whose arcs are the relations between the words. Basically, these graphs are dependency trees, with some extra edges added in for coreference and some kinds of modifiers. (The acyclicity property is important for the dynamic programming algorithm, and we can express most important grammatical relations without cycles.)

We'll define a subdag (I'd love a better word for this) as a node of the dag, with all of its descendants and the arcs connecting them. (Basically a subtree, except not a tree.) Intuitively, a subdag corresponds to some kind of phrase, so we can speak loosely of entailment between subdags. I don't have a good formal definition, but essentially, phrase A entails phrase B if everything that satisfies A must satisfy B. For instance, "blue car with four wheels" implies "car" and "before 1970" implies "before 1980". This isn't quite so intuitive with noun phrases involving quantifiers; let A and B be noun phrases. A implies B iff whenever a VP is true with A as the subject, it is also true with B as the subject (all other arguments staying the same). For instance, "all cars" implies "all blue cars", but "some gray cars" implies "some cars". We say that

The key feature of the recursive model is that the cost of matching two subdags X and Y is defined in terms of the root nodes of X and Y, the relations between them, and each of their children, and the kinds of entailment between each of their children. In other words, you only need to reach at most one level down to determine the cost of matching two subdags. Here's an example. If we wanted to prove that "all dogs bark loudly" implies "all brown dogs bark loudly", our passage and hypothesis dags (which are actually trees) might look like this:

bark
  nsubj -> all
    ant -> dogs
  advmod -> loudly
bark
  nsubj -> all
    ant -> dogs
      amod -> brown
  advmod -> loudly
To explain my confusing notation, the topmost line is the root node, and everything indented one column over is its child. The thing before the arrow is the dependency type from the node's parent, and the thing after the arrow is the word at that node.

In this example, we handle quantifiers slightly differently than the current dependency trees; the quantifier takes the NOM as a dependent, instead of the other way around. This works better for the recursive model, although there are other equivalent ways to represent quantifiers we could use instead. But what's important is that, from the scorer's perspective, when it's trying to determine that the first subdag entails the second, here is what it sees:

bark
  nsubj -> A
  advmod -> B
bark
  nsubj -> C
  advmod -> B

The scorer should also have at its disposal the knowledge that A -> C with cost 0. The key assumption, then, is that all of the entailment rules are defined with just this information in mind. Now for how to do it.

The dynamic programming algorithm

The basic idea of the algorithm is to start by matching the various leaves in both directions, and work your way up, keeping a table of all the costs. More specifically,

P <- TOPOLOGICAL-SORT(passage dag)
H <- TOPOLOGICAL-SORT(hypothesis dag)
for i = 1 to len(P)
  do for j = 1 to len(H)
    do ForwardMatchScores[i][j] <- BEST-MATCH(P[i], H[i])
      BackwardMatchScores[i][j] <- BEST-MATCH(H[i], P[i])
best_match_score = argmax_i(ForwardMatchScores[i][len(H)])      // The last item in H is the root of the hypothesis.

COMPUTE-SCORE is done by a separate scoring module, which wraps around the Weighted Abduction scorer. To keep track of the proof, we can save for each table entry which subdags were matched to arrive at the best match. Then, to print the proof, we can recursively descend down the table of best matchings.

Because the tree matching problem reduces to this problem, the dynamic programming algorithm is NP-hard for the general case. However, if we assume that the number of children of any given node does not depend on the length of the sentence, the algorithm runs in O(mn), where m and n are the lengths of the two sentences. This is true because BEST-MATCH operates locally, and its running time depends only on the number of children of the nodes being matched. Of course, as BEST-MATCH takes O(m^n) with the numbers of arguments, the constant factor may grow large quickly as we start adding more edges to the dag. However, if we relax our assumption that we need to find the absolute optimal proof, some heuristics can surely keep the local search time down without too much loss in accuracy.

Dependency Dags and Extra Arcs

So far, the above algorithm fails to find proofs in many cases where the dependency trees don't match exactly. We plan to solve this by adding extra edges to the dependency tree to deal with cases like the following:

These extra arcs would help somewhat the existing Weighted Abduction and Graph Matching systems, but they are essential to the recursive approach.

Negation

We thought of several ways to handle negation, but here's my favorite. I'm probably mangling this terminology, but we can think of verb phrases as representing an "eventuality", an event which may or may not actually exist. For example, "pigs fly" refers to a flying eventuality involving pigs, which exists in the sentence "Pigs fly" but not in the fragment "When pigs fly." Using this as our intuition, we should aim to label each of the verbs in the dags by whether their eventualities are true, false, or unknown. For example, in the sentence "I know the New York Times did not report that Bigfoot exists", the knowing action is true, the reporting action is false, and the existing action is unknown.

By extension, we could probably label the nouns in the same way, though I'm somewhat less clear on what this would mean semantically. But it would be useful for nominalizations, and if we want to conclude from "I saw an elephant in the garden" that "There is an elephant in the garden".

This labeling can be done by RTEEngine as part of the wordInfo. That way, it can be reused by all three systems. The processing can probably be done in a top-down manner, where the root node is assigned "true" unless it has a "not" attached, and the polarity of each node depends on the word and polarity of its parent.

Another way to handle negation would be to rearrange the dags so the negative takes the predicate as an argument, rather than the other way around. If we think of the verb phrase as a set of situations, we can think of the subdag headed by "not" as the complement of that set.

World/Lexical Knowledge Module

If we're dealing with a passage and hypothesis that are 30 words long and the only lexical knowledge we need is that "kicked the bucket" implies "died", it would be expensive for a lexical knowledge module to run on the entire sentences. It would be helpful if we could somehow determine which pair of subdags would be most important to match. Then, it would be easier to keep down the size of the knowledge base fed in, and to perform the actual search.

This requires estimating two values: V, the reduction in total proof cost from reducing the cost of a particular match to 0, and P, the probability that the lexical/WK module will reduce the cost to 0. The expected gain would then be VP. P could be estimated by training a classifier on trees using surface characteristics such as tree size. V is hard to compute exactly, since if the cost of matching a particular pair is reduced to 0, the optimal proof could change. But I have a feeling the recursive structure of this model might make it possible to compute a rough estimate.