Memorandum

To: AQUAINT KB evaluation interest group
cc:
Subject: April 20 meeting at PARC

Thanks to all the groups who participated in this meeting. It was very constructive and fruitful. Each team came to the meeting with a proposed set of questions, as outlined in our Palm Springs discussion.

We emphasized again that what we are trying to accomplish is come up with a prototype evaluation to better understand how a full evaluation should be conducted.

We started by reviewing the proposals in no particular order to generate discussions. Here are some highlights:

MIT used the CNS domain to generate a mix of linguistic and world knowledge questions. This raised the interesting problem of having some situations where language alone might generate an "unknown" answer, but world knowledge might answer the questions. Should we have several possible answer annotations?
PARC focused on linguistic phenomena. One discussion focused on the frequency of these phenomena in corpora. The discussion highlighted the difference between "lab" data good for scientific investigations and "field" data good for production. The consensus was to leave both in this prototype.
ICSI focused on event structure in terrorism data and metaphor in international economics. It was a mix of "lab" and "field" data. The discussion about the need for a metaphor label did not result in a decision to add it, as the inferences required are more complex.
UIUC proposed "field" data extending PASCAL.
ISI emphasized the need to prevent biais in the final evaluation. Randomization was one method of generating the questions. The concern about biais in the prototype will be alleviated by the fact that all systems will try all the proposed questions.
Stanford proposed a mix of first sentences from newswire and hypotheses generated from Google news.
Brandeis proposed examples from TimeBank, focusing on the S-Links used to understand events (Note that discussions in Dagstuhl may result in changes in S-links). Many linguistic phenomena were covered.
ASU proposed questions involving deep inferences, and using world knowledge. since many instances involve default assumptions, a new label for defaults was proposed.
LCC Ferret team proposed some analyst questions generated during the 2004 workshop. Discussion looked at true answers that may be irrelevant with respect to the original question: would we learn anything from this exercise? a trick might be to negate the questions, and see whether the answer would still be true.
LCC Power answer team focused on possible worlds and contexts when they are in the text. The team proposed adding context type and confidence labels. They also had a draft of annotation guidelines.
Cycorp had a large number of world knowledge "field" question based on "lab"examples.

The consensus was that we now have a very nice collection of test data covering both linguistic and world knowledge phenomena, involving simple and complex inferences, involving both "field" and "lab" data.

The rest of the day was spent making decisions about the form of the challenge and what our next steps might be. The details are presented in a set of transparencies written by PARC. The main points are that PASCAL's entailment pairs will systematically be replaced by questions/answers pairs, mandatory and optional system output, and mandatory and optional annotations were defined.

Annotations guidelines will be generated by May 20, a unified XML format defined by May 25.

We left discussions of evaluation and scoring for the breakout session that will take place at the June PI meeting.

The question of additional funding was raised. Teams were encouraged to express their needs in writing to John Prange. At a meeting of the executive committee on 4/22, John reiterated that he will soon provide the needed funds.

Thank you again to all the participants for a very productive meeting, and thanks to PARC for hosting the meeting.

JLPomarede

Last Revised: May 5, 2005