Stanford KBEval Data Reannotation Effort Guidelines

This reannotation effort was done in early 2006 as an attempt to produce a more consistent and usable set of gold standard annotations for the data assembled during the AQUAINT Phase 2 Knowledge-based Inference Pilot. It was done at Stanford University (as part of the AQUINAS project). The goal was to reannotate all the KB Eval answers into a consistent answer format, which was easy to use and had well-defined scoring metrics. We adopted having a single answer for questions. This is not uncontroversial, but addresses two issues. Firstly, the original annotation scheme meant that most questions should have many answers, but most data sets provided only 1 or 2 answers, and this incompleteness of the answers made usage difficult. Secondly, when multiple answers are given, there is no indication of which answer is the preferred answer, understood as the one that you would expect a normal reader to arrive at, if asked the question after reading the passage. We also sought to assess and increase the reliability of the answers. All items were independently annotated by two annotators (at least one of which, but commonly only one of which, was a native English speaker). The results at this stage were used to calculate agreement statistics. The same two annotators then conferred and came up with a consensus "best" answer, which is what now appears in the data file. Below are brief instructions for how data was reannotated, corresponding to the information given to annotators about the task.

Annotation guidelines

The goals are to have:

Every problem have an 'id' attribute that reflects its source (briefly, like "St030"). Most already do, but some don't. This allows concatenating all the data into one big XML file.
No content questions (i.e., Wh-, short answer or fill-in-the-blank questions). Every such problem (there are only 11) should be recast as a yes/no question. Follow your gut on how best to do this. Do this and come up with an agreed revised yes/no question before doing the initial round annotation. (We have nothing against content questions, but a data set with 11 content questions out of 356 total questions does not seem very useful; it's better just to have a yes/no question collection.)
Every problem should have exactly one 'answer' element (currently, many problems have several answers). The 'answer' element should have as attributes id="1" and polarity="true". In general, any existing 'force' and 'source' attributes of the 'answer' element should be left as is (or changed to reflect your intuitions of what they should be), but they are not the focus of the annotation.
Any 'provenance', 'comment', or 'because' elements should be left alone.
Each 'answer' element will contain exactly one 'response' subelement, which should contain one of the following five literal strings: "yes", "likely", "unknown", "unlikely", "no". What these categories mean, and how they should be applied, is a big topic, so it is addressed in a separate section below.
The answer given is the one that an intelligent and reasonable reader would give to the question based on the text (a la Pascal RTE). That is, it reflects the expected real world human interpretation, not a consideration of possible readings that you could imagine the sentence having.

The five possible responses

Assigning a problem to one of the five response labels will necessarily be somewhat subjective. The paragraphs below are an attempt to describe the intended interpretation of each of the five categories. However, this should be regarded as a living document; these guidelines may be revised as we gain experience with the reannotation.

The annotation follows the proposal where there are 5 answers possible for a question. Note that we use the labels "likely" and "unlikely" to keep the labels as one word, but the sense we are trying to capture is more "extremely likely" and "extremely unlikely", respectively.

"yes"

Given that the passage is true, the question necessarily has a "yes" answer. This follows as a logical entailment from the material in the text, perhaps supplemented by common-sense knowledge or world knowledge: facts about time, space, and geography, etc. The range of world knowledge that can be used is ill-defined, but it roughly corresponds to Lenat's notion of common sense knowledge. For example, you are allowed to assume that a CEO is a person, that animals die, etc. It also includes basic "encyclopedic" knowledge like facts about geography. However, it should not include very specific information relevant to the text (such as information about when a particular skier died), and, crucially, the answer cannot be derived solely or mainly from the world knowledge: the central proposition of the question must be explicitly supported by the text -- see the discussion under "unknown".). That is, the text has to be something that can be regarded as answering the question in the following sense: if a person asked you for not only the answer to a question but a document that supports the answer, then showing them a document containing the text at hand would satisfy the person. What if the world knowledge needed is quite obscure? A reasonable operational principle might be to only allow in world knowledge if you think that more than 3/4 of college-educated people would know the fact.

(Highly) "likely"

The answer is very likely "yes" but this depends on certain assumptions. These may have a linguistic basis, covering conversational and conventional implicatures ("Bill had time to read the memo" --> "Did Bill read the memo? Highly Likely.", in the absence of evidence to the contrary), or it may depend on real world reasoning ("The anti-terrorist court found two men guilty of murdering Shapour Bakhtiar and his secretary Sorush Katibeh, who were found with their throats cut in August 1991." --> "Did Shapour Bakhtiar die in 1991? Highly Likely."). It shouldn't include things that are more loosely "probably" (e.g., "Many experts think that there is likely to be another terrorist attack on American soil within the next five years." "Will there be another terrorist attack on American soil within the next five years?" doesn't qualify for this category). The basic standard is that a human would be happy to answer "yes" based on reading the text (in the absence of other evidence), but answering "yes" does depend on certain assumptions such that the answer is not a strict entailment of the text and world knowledge. In cases of communication verbs (reported, said), etc.), the complement is often "likely" to be true. But your response should not be based just on the syntactic construction, but your evaluation of the source and the content. You have to make the judgement of an "intelligent reader". If you thing nearly everyone would assume that something was true based on the report, the answer is "likely". For example, from "the New York Times reported that 5 people died in a subway fire in London yesterday", it seems highly likely that 5 people died in a subway fire in London yesterday. However, if the statement is "Bush says that Syria is not releasing all the information it has about the killings in Lebanon", then for the hypothesis "Syria is withholding information on the killings in Lebanon", you may feel that an answer of "unknown" is more appropriate.

"unknown"

This covers items where there is fairly great uncertainty based on the text (the terrorist attack example above) or cases that are unknown because the text doesn't answer the question. Note: things can only have a status other than "unknown" if the text substantively addresses the main point of the question. So for, "While civilians ran for cover or fled to the countryside, Russian forces were seen edging their artillery guns closer to Grozny, and Chechen fighters were offering little resistance." Q: "Is Grozny is the capital of Chechnya?", the answer is "unknown", even though Grozny is the capital of Chechnya. If something is more likely than not, but doesn't reach a standard of evidence such that a human would be happy to conclude the hypothesis based on the evidence given in the text, then the answer is "unknown". For instance:

Text: It is likely that Bin Laden was in Tora Bora.
Question: Was Bin Laden in Tora Bora?
Answer: Unknown

(Highly) "unlikely"

This category is the reverse of "likely", and uses the "a human would assume a no answer based on reading the text" standard, but there are some necessary assumptions. For example:

Text: The man had $20 in his pocket.
Question: Did the man have $40 in his pocket?
Answer: unlikely

"no"

This category is the reverse of "yes", and uses the logical entailment standard.

The motivation for this answer annotation is to preserve some of the ideas of the KBEval standard while simultaneously simplifying things. The ideas that we preserve are differentiating contradicitons or likely contradictions ("no" and "unlikely") from text that is irrelevant or gives insufficiently precise information, and differentiating necessary consequences of the text from things that are likely to follow (as implicatures, etc.). Linguistic and world knowledge is, however, no longer distinguished in the answer which would be used for scoring. I think that's okay, since there are many uncertain cases. Note that, with this standard, we can also then evaluate things "Pascal-style" by lumping categories 1-2 vs. 3-5, we can do plausible reasoning with "no" answers by lumping 1-2 vs. 3 vs. 4-5, while someone interested only in strict entailments could lump 1 vs. 2-4 vs. 5.

Additional guidelines added during reannotation

Here's another thought about how to distinguish the "likely" category from "yes" or "unknown" (and similarly for "unlikely"):

To the extent that there is uncertainty (probability) involved in appropriate applications of the "likely" label, it seems to be uncertainty about the intended meaning of the passage, not about whether the event or proposition described is factual.

Here's an illustrative example from the Cycorp data set. (The <because> and <assumptions> elements are from the original.)

  <inference id="Cycorp-014">
    <passage>The Paulsons celebrated their 25th anniversary on June 14, 2004.</passage>
    <question>Did the Paulsons get married on Flag Day?</question>
    <provenance type="edited">unknown-source</provenance>
    <answer id="1" polarity="true" force="plausible" source="world">
      <response>likely</response>
      <because>Flag Day is June 14th, but the celebration may have been observed on a different day than the actual anniversary.</because>
      <assumptions>Whether one regards this response as strictly true or plausibly true depends on whether the passage is taken as describing an actual celebration--which may not have been observed on the anniversary date--or simply as a way of saying that this anniversary did in fact occur for the couple.</assumptions>
    </answer>
  </inference>

In this example, there's no uncertainty about the events -- the uncertainty is about the precise meaning intended for "celebrate". It seems likely that author intended, not merely that they had a celebration on June 14, but that June 14 was in fact their anniversary. So the hypothesis follows from the most natural reading, but it is not inescapable.

Appropriate applications of the "likely" label often involve implicatures, and it is a hallmark of implicatures that they can be explicitly cancelled. Thus, for example, we could imagine the passage sentence above being modified to end with "even though their anniversary was actually on June 13". With this cancellation, the hypothesis no longer follows.

So perhaps some useful diagnostic tests are:

Is the hypothesis pretty much inescapable, or can you imagine a (not too farfetched) way to wriggle out of it?
Could the passage be modified to something which, while still self-consistent, cancels the inference to the hypothesis?

Annotation log

Data	Annotator 1 name	Annotator 1 status	Annotator 2 name	Annotator 2 status	Differences adjudicated
PARC Dev	Bill MacCartney	done	Dan Cer	done	done
Stanford Dev	Marie-Catherine de Marneffe	done	Dan Cer	done	done
MIT Dev	Marie-Catherine de Marneffe	done	Teg Grenager	done	done
Cycorp Dev	Bill MacCartney	done	Teg Grenager	done	done
LCC-Harabagiu Dev	Dan Ramage	done	Pi-chuan Chang	done	done
ATM Dev (Arizona State University)	Dan Ramage	done	Anna Rafferty	done	done
Brandeis Dev	Christopher Manning	done	Rajat Raina	done	done
LCC-Moldovan Dev	Christopher Manning	done	Rajat Raina	done	done
University of Texas, Dallas and ICSI	Pi-chuan Chang	done	Anna Rafferty	done	done

Annotator agreement statistics

Two annotators independently annotated each data set. Below we show the confusion matrix and agreement rates between these two initial annotations, for each data set, and for all data sets combined. The annotators then chose a consensus label for each item on which they disagreed. Presumably, in general, each annotators agreement rate with the final consensus annotation would be higher than the numbers shown here (but we have not currently calculated these numbers).

Confusion matrix for files
 atm_dev.dramage.question_normalized.kbe.xml
 atm_dev.rafferty.question_normalized.kbe.xml

         yes    likely  unknown unlikely no
yes      9       0       1       0       0       
likely   1       12      0       0       0       
unknown  0       0       3       1       0       
unlikely 0       0       0       7       0       
no       0       0       0       0       3       


Confusion matrix for files
 brandeis_dev.manning.kbe.xml
 brandeis_dev.rajatr.kbe.xml

         yes    likely  unknown unlikely no
         10      8       2       0       0       
         0       1       3       0       0       
         0       3       6       1       0       
         0       0       0       3       0       
         0       0       0       3       2       


Confusion matrix for files
 cycorp_dev.wcmac.kbe.xml
 cycorp_dev.teg.kbe.xml

         yes    likely  unknown unlikely no
         8       1       2       0       0       
         0       5       2       0       0       
         0       1       3       1       0       
         0       0       3       1       0       
         0       1       1       3       4       


Confusion matrix for files
 lcch_dev.dramage.xml
 lcch_dev.pichuan.kbe.xml

         yes    likely  unknown unlikely no
         13      0       0       0       0       
         2       2       0       0       0       
         1       1       6       1       0       
         1       0       2       3       0       
         0       0       0       1       5       


Confusion matrix for files
 lccm_dev.manning.kbe.xml
 lccm_dev.rajatr.kbe.xml

         yes    likely  unknown unlikely no
         6       3       0       0       0       
         0       6       5       0       0       
         0       1       1       0       0       
         0       0       0       0       0       
         0       0       1       2       5       


Confusion matrix for files
 mit_dev.mcdm.kbe.xml
 mit_dev.teg.kbe.xml

         yes    likely  unknown unlikely no
         9       4       0       0       0       
         3       1       3       0       0       
         0       0       4       0       0       
         0       0       2       1       0       
         0       0       0       1       2       


Confusion matrix for files
 parc_dev.wcmac.kbe.xml
 parc_dev.cerd.kbe.xml

         yes    likely  unknown unlikely no
         38      2       0       0       0       
         2       0       2       0       0       
         0       5       14      0       0       
         0       0       0       2       0       
         0       0       0       0       11      


Confusion matrix for files
 stanford_dev.mcdm.kbe.xml
 stanford_dev.cerd.kbe.xml

         yes    likely  unknown unlikely no
         8       2       0       0       0       
         5       2       0       1       0       
         0       1       4       0       0       
         0       0       1       2       1       
         0       0       0       0       3       


Confusion matrix for files
 utdicsi_dev.pichuan.kbe.xml
 utdicsi_dev.rafferty.kbe.xml

         yes    likely  unknown unlikely no
         7       5       0       0       0       
         0       2       0       0       0       
         0       2       9       2       0       
         0       0       0       1       0       
         0       0       1       1       7       


Aggregated confusion matrix for all files

               yes   likely   unknown   unlikely      no

1  yes         108       25         5          0       0
2  likely       13       31        15          1       0
3  unknown       1       14        50          6       0
4  unlikely      1        0         8         20       1
5  no            0        1         3         11      42

Overall, for the 5 way classification (aligning the first and 2nd annotators as shown in the above table), the annotators agreed on 251/356 questions = 70.51% of the time. This gives a kappa statistic of 0.61. As the confusion matrix shows, the greatest confusions are between "yes" and "highly likely" and also between "highly likely" and "unknown" (but this is also the region of the scale where the majority of the data is).

It is also useful to consider lumping together categories, in the way discussed above. If we lump categories 1&2 vs. 3&4&5, we agreed on 318/356 = 89.33% (kappa = 0.78). If we lump categories 1&2 vs. 3 vs. 4&5, we agreed on 301/356 = 84.55% (kappa = 0.74). If we lump categories 1 vs. 2&3&4 vs. 5, we agreed on 295/356 = 82.87% (kappa = 0.72). This result is at least suggestive of there being greater agreement on the "Pascal" annotation task (1&2 vs. 3&4&5) than on other natural classifications of the data.

This was a pilot exercise: with some clarification of the guidelines, it appears very likely that annotator agreement on at least the 1&2 vs. 3&4&5 decision could be pushed above a kappa of 0.8, which in practice is often taken as the minimum threshold for good interannotator reliability in coding.

Last modified: Mon Feb 13 15:31:39 PST 2006