This reannotation effort was done in early 2006 as an attempt to produce a more consistent and usable set of gold standard annotations for the data assembled during the AQUAINT Phase 2 Knowledge-based Inference Pilot. It was done at Stanford University (as part of the AQUINAS project). The goal was to reannotate all the KB Eval answers into a consistent answer format, which was easy to use and had well-defined scoring metrics. We adopted having a single answer for questions. This is not uncontroversial, but addresses two issues. Firstly, the original annotation scheme meant that most questions should have many answers, but most data sets provided only 1 or 2 answers, and this incompleteness of the answers made usage difficult. Secondly, when multiple answers are given, there is no indication of which answer is the preferred answer, understood as the one that you would expect a normal reader to arrive at, if asked the question after reading the passage. We also sought to assess and increase the reliability of the answers. All items were independently annotated by two annotators (at least one of which, but commonly only one of which, was a native English speaker). The results at this stage were used to calculate agreement statistics. The same two annotators then conferred and came up with a consensus "best" answer, which is what now appears in the data file. Below are brief instructions for how data was reannotated, corresponding to the information given to annotators about the task.
The goals are to have:
Assigning a problem to one of the five response labels will necessarily be somewhat subjective. The paragraphs below are an attempt to describe the intended interpretation of each of the five categories. However, this should be regarded as a living document; these guidelines may be revised as we gain experience with the reannotation.
The annotation follows the proposal where there are 5 answers possible for a question. Note that we use the labels "likely" and "unlikely" to keep the labels as one word, but the sense we are trying to capture is more "extremely likely" and "extremely unlikely", respectively.
The motivation for this answer annotation is to preserve some of the ideas of the KBEval standard while simultaneously simplifying things. The ideas that we preserve are differentiating contradicitons or likely contradictions ("no" and "unlikely") from text that is irrelevant or gives insufficiently precise information, and differentiating necessary consequences of the text from things that are likely to follow (as implicatures, etc.). Linguistic and world knowledge is, however, no longer distinguished in the answer which would be used for scoring. I think that's okay, since there are many uncertain cases. Note that, with this standard, we can also then evaluate things "Pascal-style" by lumping categories 1-2 vs. 3-5, we can do plausible reasoning with "no" answers by lumping 1-2 vs. 3 vs. 4-5, while someone interested only in strict entailments could lump 1 vs. 2-4 vs. 5.
Here's another thought about how to distinguish the "likely" category from "yes" or "unknown" (and similarly for "unlikely"):
To the extent that there is uncertainty (probability) involved in appropriate applications of the "likely" label, it seems to be uncertainty about the intended meaning of the passage, not about whether the event or proposition described is factual.
Here's an illustrative example from the Cycorp data set. (The <because> and <assumptions> elements are from the original.)
<inference id="Cycorp-014"> <passage>The Paulsons celebrated their 25th anniversary on June 14, 2004.</passage> <question>Did the Paulsons get married on Flag Day?</question> <provenance type="edited">unknown-source</provenance> <answer id="1" polarity="true" force="plausible" source="world"> <response>likely</response> <because>Flag Day is June 14th, but the celebration may have been observed on a different day than the actual anniversary.</because> <assumptions>Whether one regards this response as strictly true or plausibly true depends on whether the passage is taken as describing an actual celebration--which may not have been observed on the anniversary date--or simply as a way of saying that this anniversary did in fact occur for the couple.</assumptions> </answer> </inference>
In this example, there's no uncertainty about the events -- the uncertainty is about the precise meaning intended for "celebrate". It seems likely that author intended, not merely that they had a celebration on June 14, but that June 14 was in fact their anniversary. So the hypothesis follows from the most natural reading, but it is not inescapable.
Appropriate applications of the "likely" label often involve implicatures, and it is a hallmark of implicatures that they can be explicitly cancelled. Thus, for example, we could imagine the passage sentence above being modified to end with "even though their anniversary was actually on June 13". With this cancellation, the hypothesis no longer follows.
So perhaps some useful diagnostic tests are:
Data | Annotator 1 name | Annotator 1 status | Annotator 2 name | Annotator 2 status | Differences adjudicated |
PARC Dev | Bill MacCartney | done | Dan Cer | done | done |
Stanford Dev | Marie-Catherine de Marneffe | done | Dan Cer | done | done |
MIT Dev | Marie-Catherine de Marneffe | done | Teg Grenager | done | done |
Cycorp Dev | Bill MacCartney | done | Teg Grenager | done | done |
LCC-Harabagiu Dev | Dan Ramage | done | Pi-chuan Chang | done | done |
ATM Dev (Arizona State University) |
Dan Ramage | done | Anna Rafferty | done | done |
Brandeis Dev | Christopher Manning | done | Rajat Raina | done | done |
LCC-Moldovan Dev | Christopher Manning | done | Rajat Raina | done | done |
University of Texas, Dallas and ICSI | Pi-chuan Chang | done | Anna Rafferty | done | done |
Two annotators independently annotated each data set. Below we show the confusion matrix and agreement rates between these two initial annotations, for each data set, and for all data sets combined. The annotators then chose a consensus label for each item on which they disagreed. Presumably, in general, each annotators agreement rate with the final consensus annotation would be higher than the numbers shown here (but we have not currently calculated these numbers).
Confusion matrix for files atm_dev.dramage.question_normalized.kbe.xml atm_dev.rafferty.question_normalized.kbe.xml yes likely unknown unlikely no yes 9 0 1 0 0 likely 1 12 0 0 0 unknown 0 0 3 1 0 unlikely 0 0 0 7 0 no 0 0 0 0 3 Confusion matrix for files brandeis_dev.manning.kbe.xml brandeis_dev.rajatr.kbe.xml yes likely unknown unlikely no 10 8 2 0 0 0 1 3 0 0 0 3 6 1 0 0 0 0 3 0 0 0 0 3 2 Confusion matrix for files cycorp_dev.wcmac.kbe.xml cycorp_dev.teg.kbe.xml yes likely unknown unlikely no 8 1 2 0 0 0 5 2 0 0 0 1 3 1 0 0 0 3 1 0 0 1 1 3 4 Confusion matrix for files lcch_dev.dramage.xml lcch_dev.pichuan.kbe.xml yes likely unknown unlikely no 13 0 0 0 0 2 2 0 0 0 1 1 6 1 0 1 0 2 3 0 0 0 0 1 5 Confusion matrix for files lccm_dev.manning.kbe.xml lccm_dev.rajatr.kbe.xml yes likely unknown unlikely no 6 3 0 0 0 0 6 5 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 2 5 Confusion matrix for files mit_dev.mcdm.kbe.xml mit_dev.teg.kbe.xml yes likely unknown unlikely no 9 4 0 0 0 3 1 3 0 0 0 0 4 0 0 0 0 2 1 0 0 0 0 1 2 Confusion matrix for files parc_dev.wcmac.kbe.xml parc_dev.cerd.kbe.xml yes likely unknown unlikely no 38 2 0 0 0 2 0 2 0 0 0 5 14 0 0 0 0 0 2 0 0 0 0 0 11 Confusion matrix for files stanford_dev.mcdm.kbe.xml stanford_dev.cerd.kbe.xml yes likely unknown unlikely no 8 2 0 0 0 5 2 0 1 0 0 1 4 0 0 0 0 1 2 1 0 0 0 0 3 Confusion matrix for files utdicsi_dev.pichuan.kbe.xml utdicsi_dev.rafferty.kbe.xml yes likely unknown unlikely no 7 5 0 0 0 0 2 0 0 0 0 2 9 2 0 0 0 0 1 0 0 0 1 1 7 Aggregated confusion matrix for all files yes likely unknown unlikely no 1 yes 108 25 5 0 0 2 likely 13 31 15 1 0 3 unknown 1 14 50 6 0 4 unlikely 1 0 8 20 1 5 no 0 1 3 11 42
Overall, for the 5 way classification (aligning the first and 2nd annotators as shown in the above table), the annotators agreed on 251/356 questions = 70.51% of the time. This gives a kappa statistic of 0.61. As the confusion matrix shows, the greatest confusions are between "yes" and "highly likely" and also between "highly likely" and "unknown" (but this is also the region of the scale where the majority of the data is).
It is also useful to consider lumping together categories, in the way discussed above. If we lump categories 1&2 vs. 3&4&5, we agreed on 318/356 = 89.33% (kappa = 0.78). If we lump categories 1&2 vs. 3 vs. 4&5, we agreed on 301/356 = 84.55% (kappa = 0.74). If we lump categories 1 vs. 2&3&4 vs. 5, we agreed on 295/356 = 82.87% (kappa = 0.72). This result is at least suggestive of there being greater agreement on the "Pascal" annotation task (1&2 vs. 3&4&5) than on other natural classifications of the data.
This was a pilot exercise: with some clarification of the guidelines, it appears very likely that annotator agreement on at least the 1&2 vs. 3&4&5 decision could be pushed above a kappa of 0.8, which in practice is often taken as the minimum threshold for good interannotator reliability in coding.
Last modified: Mon Feb 13 15:31:39 PST 2006 |