RTE3 Optional Pilot Task: Extending the Evaluation of Inferences from Texts

The pilot has now been run. The gold answer key for the 3 way decision task is now available on this page. Justifications have been assessed. The results of the task were presented in June at the RTE/paraphrase workshop at ACL 2007 (slides available below), and will appear in an ACL 2008 paper. The idea of 3-way evaluation was adopted for RTE4, organized as part of the Text Analysis Conference (TAC).

Original blurb

PASCAL RTE has successfully trailblazed a path for evaluating the capacity of systems to automatically infer information from texts. However, it does not presently address all issues in textual entailment. At least one new area is already being addressed this year within RTE3: trialing the use of longer passage texts. This optional pilot explores two other tasks closely related to textual entailment: differentiating unknown from false/contradicts and providing justifications for answers. This task will piggyback on the existing RTE3 Challenge infrastructure and evaluation process by using the same test set but with a later submission deadline for answers than the primary task.

The goal of making a three-way decision of "YES" (entails), "NO" (contradicts), and "UNKNOWN" is to drive systems to make more precise informational distinctions. A hypothesis being unknown on the basis of a text should be distinguished from a hypothesis being shown false/contradicted by a text. The goal for providing justifications for decisions is to explore how eventual users of tools incorporating entailment can be made to understand how decisions were reached by a system. Users are unlikely to use or trust a system that gives no explanation for its decisions.

The pilot task seeks participation from all interested parties, and we hope that it will be of interest to many PASCAL RTE participants, and can help inform the design of the main task for future RTE Challenges. The US National Institute of Standards and Technology (NIST) will perform evaluation, using human assessors for the inference task.

Data/Downloads/Pilot Results

Available in advance of the pilot

Annotation guidelines for marking contradictions (versus unknown) [pdf] (by Marie-Catherine de Marneffe and Christopher Manning; used as a basis for evaluation at NIST)
20 sample problems with evaluation of their justifications
RTE3_dev data (revised version) annotated with a 3 way (YES, UNKNOWN, NO) answer key. (This annotation was done by students at Stanford.)

Available after the pilot

RTE3_test data 3 way classification gold answer key. (This is just a text file with the item ID followed by the answer, not the XML file with answers embedded. This was done as double annotation by NIST assessors, followed by adjudication of disagreements. Answers were kept consistent with the 2 way decisions in the main task gold answer file.)
Slides from a presentation [pdf] by Hoa Dang and Ellen Voorhees, presented in June 2007 at the RTE/paraphrase workshop at ACL 2007.
Decisions and results on the 3-way decision task for the twelve submitted runs: A, B, C, D, E, F, G, H, I, J, K, L.
There will be a paper at ACL 2008 on the pilot task, by Ellen Voorhees.

Contact information for organizers

This pilot was organized by Christopher Manning <manning@cs.stanford.edu>, Dan Moldovan <moldovan@languagecomputer.com>, and Ellen Voorhees <ellen.voorhees@nist.gov>, with input from the PASCAL RTE Organizers. Please direct any questions to the pilot organizers, putting "[RTE3]" in the subject header.

Optional Pilot Task Description

Everyone is invited to participate in the extended task.
Teams participating in the extended task will be asked to treat the RTE3 test data as blind test data until after they submit to the extended task.
Teams participating in the extended task submit a 3-way answer key for the test set used in the primary task.
Optionally, a team can also submit a justification for how the answer was derived for each pair.
The 3-way answers use the same format as the standard PASCAL submission, but are unranked (and allow 3 answers: YES, UNKNOWN, NO).
A justification consists of a set of ASCII strings delimited by begin/end tags. The purpose of the justification is to explain to an ordinary person (i.e., not a linguist or logician) why the given answer is correct. True examples should indicate the basis for concluding that a hypothesis is true. Otherwise, the justification should indicate at least one reason why the hypothesis does not follow from the text. In either case, a system should provide any background, lexical, or world knowledge that it uses in addressing a pair and indicate which parts of the text are used to justify or differentiate from which parts of the hypothesis. The format and content of justifications is intentionally underspecified since we are interested in learning what makes a good justification.
People may submit up to 2 answer keys to the pilot task. The answers need not be consistent with their submission to the main RTE task.
The three-way decisions are made by splitting the "NO" category of the primary task's gold standard answer key into NO and UNKNOWN categories. The criterion for "NO" mirrors the standard of proof for PASCAL RTE's "YES": it is very unlikely that the text and hypothesis could both be true. NIST will determine a gold standard answer key, score the submitted runs with it, and they will make the gold standard key available.
The 3-way answer key is scored using two metrics (on unranked answers): accuracy and Fbeta=3 of precision and recall on the YES and NO categories, with the weighting preferring high precision. (This allows a system that opts for "UNKNOWN" when it is unsure of the answer to receive reasonable credit.)
NIST human assessors will also assign scores for some (relatively small) subset of the justifications of the test set pairs. The subset to use will be selected to include YES, NO, and UNKNOWN pairs. The size of the subset will be largely determined by how many submissions are received and how difficult it is to assess the justifications.
Justifications will be scored on a 5 point scale for each of correctness and usability. 'Usability' is whether the assessor can comprehend the justification. If (and only if) the justification receives a high-enough score on the usability component, then the assessor will assign a score for correctness. A system will be marked down for correctness if it made inferences which clearly do not follow from the text and provided background knowledge, or failed to draw inferences that were possible.
It is not possible to construct a gold standard answer key for justifications. NIST will compute some (to be determined) aggregate score for the justification component for submitted runs.
A report version of the extended task will be prepared in time for the RTE-3 workshop (but separately, not within the usual ACL proceedings process), and some time will be made available to discuss it during the RTE-3 workshop. The timing would require participants to mostly write a report on their systems prior to the release of results.

Amendments

There has been a question whether submissions to the RTE justification task must be created completely automatically. The answer is no, with caveats:

A justification submission must contain a justification for all 800 test set pairs.
If any non-automatic processing is done, the run is designated a manual run and must be declared as such in the comments fields when the run is submitted.
The manual processing must be "constrained". That is, it should be really obvious how the processing could be automated. That is, the assumption is that the lack of complete automation of the run is a matter of time to get it implemented rather than don't know how to implement it.
Completely automatic runs are encouraged so we can get a baseline reading of what systems can really do right now.

The RTE guidelines will be updated to reflect this decision. Note that the entailment decisions (YES, NO, UNKNOWN) are to be made completely automatically.

Instructions for Submitting Results

There are two types of submissions to the RTE-3 extended task: submissions that only tag T-H pairs as YES, NO, or UNKNOWN; and submissions that both tag T-H pairs and provide justifications for each response. For each type of submission, the submission must cover the whole dataset. In particular, if a justification is given for any response, a justification must be given for all responses. (Note, however, that since a justification is defined as an arbitrary collection of ASCII strings, the empty justification is a justification.)

Each submission must be contained in a single file. The name of that file will be treated as the name of the submission, referred to as the "run tag". Run tags must contain only alphanumeric characters and be no longer than 12 characters long. If you submit multiple runs, make sure they have different tags! Please include an identifier for your group as part of the run tag both to help ensure all run tags are unique and to make it easier to associate run results with groups.

The first line of a submission file must contain the literal "Justifications" followed by white space (not including any line break characters) followed by either "YES" or "NO". This line indicates whether the submission contains justifications (YES) or does not contain justifications (NO).

If the submission does not contain justifications, the remainder of the file must consist of lines of the form

<pair-id> <judgment>

where pair-id is the unique identifier of each T-H pair as it appears in the test set and <judgment> is exactly one of "YES", "NO", or "UNKNOWN". Each response must appear on its own line. There must be exactly one response for each pair in the test set (1-800). The order of the pairs in the submission file can be arbitrary as it will be ignored. Thus, a sample no-justifications submission might look like this:

Justifications NO
1 NO
2 UNKNOWN
3 YES
4 YES
...
800 UNKNOWN

A submission that contains justifications contains all the lines of a non-justification submission and in addition contains exactly one justification for each pair. A justification begins with an opening justification tag on its own line:

<just pair-id>

where the angle brackets and "just" are literals and the pair-id is again the unique identifier for the pair being justified. A justification ends with a closing tag on its own line:

</just>

In between the opening and closing tags may be any number of lines each containing any arbitrary ASCII characters except that a justification cannot contain "</just>" on a line by itself (because that will be treated as the closing tag). The order the justifications are given is arbitrary and will be ignored. There must be exactly one justification given for each pair. Thus, a sample submission that contains justifications might look like this:

Justifications YES
1 NO
<just 1>
foo contradicts bar
</just>
2 UNKNOWN
<just 2>
Both
foo and bar
are possible
</just> 
...
800 UNKNOWN
<just 800>
</just>

Line break characters are significant. When other white space is expected at least one white space character must appear, but additional white space will be ignored outside of justifications. White space inside justifications will be preserved.

Submissions will be accepted through a password-protected automatic submission system at NIST. The page

http://www.itl.nist.gov/iaui/894.02/robust.te/robust.te.html

contains a link to a script that can be used to check the syntactic correctness of your submission file and a link to the submission page. To participate in the extended task, you MUST email Ellen Voorhees <ellen.voorhees@nist.gov> in advance to register and get the password for the page. Please put "[RTE3]" in your subject line. The registration message should include the name of your organization, or yourself as an individual, as you would like it to appear in the results, and the name and email address of the contact person. The deadline for submissions is 11:59pm PDT April 30, 2007. Please register for the task well before the deadline; in particular, do not expect to receive the password to the site the evening of April 30.

The submission page asks for the email address of the person doing the submission, the organization name, the run tag (see above), the submission file, and a free-text description of the submission. The file will be checked for syntactic correctness using the perl script "check_rte.pl". This script is linked to from the page given above if you wish to test your file before you submit it. Testing is strongly recommended since submissions containing syntax errors will be automatically rejected by the system. An acknowledgment will be mailed to the supplied email address and will indicate whether there were any problems with the submission. Your run has not been officially submitted until you receive a SUCCESS acknowledgment.

Please understand that the April 30 deadline is a firm deadline because the time to prepare the data for the human assessors is very short. All submissions must be made through the automated submission system by the deadline.

Schedule

Guidelines distributed: Feb 23, 2007
A 3-way answer key for the RTE-3 development data is available: Feb 28, 2007
Sample justifications (8-10 illustrative examples and how they might be judged) available: Mar 30, 2007.
Instructions for submitting results will be distributed by April 13, 2007, and the submission system will be open for submissions April 16, 2007.
Submissions for the extended task must be received by April 30, 2007.
Results for both parts of the extended task will be returned to participants no later than June 7.