RTE3 Optional Pilot Task: Extending the Evaluation of Inferences from Texts

The pilot has now been run. The gold answer key for the 3 way decision task is now available on this page. Justifications have been assessed. The results of the task were presented in June at the RTE/paraphrase workshop at ACL 2007 (slides available below), and will appear in an ACL 2008 paper. The idea of 3-way evaluation was adopted for RTE4, organized as part of the Text Analysis Conference (TAC).

Original blurb

PASCAL RTE has successfully trailblazed a path for evaluating the capacity of systems to automatically infer information from texts. However, it does not presently address all issues in textual entailment. At least one new area is already being addressed this year within RTE3: trialing the use of longer passage texts. This optional pilot explores two other tasks closely related to textual entailment: differentiating unknown from false/contradicts and providing justifications for answers. This task will piggyback on the existing RTE3 Challenge infrastructure and evaluation process by using the same test set but with a later submission deadline for answers than the primary task.

The goal of making a three-way decision of "YES" (entails), "NO" (contradicts), and "UNKNOWN" is to drive systems to make more precise informational distinctions. A hypothesis being unknown on the basis of a text should be distinguished from a hypothesis being shown false/contradicted by a text. The goal for providing justifications for decisions is to explore how eventual users of tools incorporating entailment can be made to understand how decisions were reached by a system. Users are unlikely to use or trust a system that gives no explanation for its decisions.

The pilot task seeks participation from all interested parties, and we hope that it will be of interest to many PASCAL RTE participants, and can help inform the design of the main task for future RTE Challenges. The US National Institute of Standards and Technology (NIST) will perform evaluation, using human assessors for the inference task.

Data/Downloads/Pilot Results

Available in advance of the pilot

Available after the pilot

Contact information for organizers

This pilot was organized by Christopher Manning <manning@cs.stanford.edu>, Dan Moldovan <moldovan@languagecomputer.com>, and Ellen Voorhees <ellen.voorhees@nist.gov>, with input from the PASCAL RTE Organizers. Please direct any questions to the pilot organizers, putting "[RTE3]" in the subject header.

Optional Pilot Task Description

Amendments

There has been a question whether submissions to the RTE justification task must be created completely automatically. The answer is no, with caveats:

  1. A justification submission must contain a justification for all 800 test set pairs.
  2. If any non-automatic processing is done, the run is designated a manual run and must be declared as such in the comments fields when the run is submitted.
  3. The manual processing must be "constrained". That is, it should be really obvious how the processing could be automated. That is, the assumption is that the lack of complete automation of the run is a matter of time to get it implemented rather than don't know how to implement it.
  4. Completely automatic runs are encouraged so we can get a baseline reading of what systems can really do right now.

The RTE guidelines will be updated to reflect this decision. Note that the entailment decisions (YES, NO, UNKNOWN) are to be made completely automatically.

Instructions for Submitting Results

There are two types of submissions to the RTE-3 extended task: submissions that only tag T-H pairs as YES, NO, or UNKNOWN; and submissions that both tag T-H pairs and provide justifications for each response. For each type of submission, the submission must cover the whole dataset. In particular, if a justification is given for any response, a justification must be given for all responses. (Note, however, that since a justification is defined as an arbitrary collection of ASCII strings, the empty justification is a justification.)

Each submission must be contained in a single file. The name of that file will be treated as the name of the submission, referred to as the "run tag". Run tags must contain only alphanumeric characters and be no longer than 12 characters long. If you submit multiple runs, make sure they have different tags! Please include an identifier for your group as part of the run tag both to help ensure all run tags are unique and to make it easier to associate run results with groups.

The first line of a submission file must contain the literal "Justifications" followed by white space (not including any line break characters) followed by either "YES" or "NO". This line indicates whether the submission contains justifications (YES) or does not contain justifications (NO).

If the submission does not contain justifications, the remainder of the file must consist of lines of the form

<pair-id> <judgment>
where pair-id is the unique identifier of each T-H pair as it appears in the test set and <judgment> is exactly one of "YES", "NO", or "UNKNOWN". Each response must appear on its own line. There must be exactly one response for each pair in the test set (1-800). The order of the pairs in the submission file can be arbitrary as it will be ignored. Thus, a sample no-justifications submission might look like this:
Justifications NO
1 NO
2 UNKNOWN
3 YES
4 YES
...
800 UNKNOWN
A submission that contains justifications contains all the lines of a non-justification submission and in addition contains exactly one justification for each pair. A justification begins with an opening justification tag on its own line:
<just pair-id>
where the angle brackets and "just" are literals and the pair-id is again the unique identifier for the pair being justified. A justification ends with a closing tag on its own line:
</just>
In between the opening and closing tags may be any number of lines each containing any arbitrary ASCII characters except that a justification cannot contain "</just>" on a line by itself (because that will be treated as the closing tag). The order the justifications are given is arbitrary and will be ignored. There must be exactly one justification given for each pair. Thus, a sample submission that contains justifications might look like this:
Justifications YES
1 NO
<just 1>
foo contradicts bar
</just>
2 UNKNOWN
<just 2>
Both
foo and bar
are possible
</just> 
...
800 UNKNOWN
<just 800>
</just> 

Line break characters are significant. When other white space is expected at least one white space character must appear, but additional white space will be ignored outside of justifications. White space inside justifications will be preserved.

Submissions will be accepted through a password-protected automatic submission system at NIST. The page

http://www.itl.nist.gov/iaui/894.02/robust.te/robust.te.html

contains a link to a script that can be used to check the syntactic correctness of your submission file and a link to the submission page. To participate in the extended task, you MUST email Ellen Voorhees <ellen.voorhees@nist.gov> in advance to register and get the password for the page. Please put "[RTE3]" in your subject line. The registration message should include the name of your organization, or yourself as an individual, as you would like it to appear in the results, and the name and email address of the contact person. The deadline for submissions is 11:59pm PDT April 30, 2007. Please register for the task well before the deadline; in particular, do not expect to receive the password to the site the evening of April 30.

The submission page asks for the email address of the person doing the submission, the organization name, the run tag (see above), the submission file, and a free-text description of the submission. The file will be checked for syntactic correctness using the perl script "check_rte.pl". This script is linked to from the page given above if you wish to test your file before you submit it. Testing is strongly recommended since submissions containing syntax errors will be automatically rejected by the system. An acknowledgment will be mailed to the supplied email address and will indicate whether there were any problems with the submission. Your run has not been officially submitted until you receive a SUCCESS acknowledgment.

Please understand that the April 30 deadline is a firm deadline because the time to prepare the data for the human assessors is very short. All submissions must be made through the automated submission system by the deadline.

Schedule