Memorandum
To: AQUAINT KB evaluation interest group
cc:
Subject: April 20 meeting at PARC
Thanks to all the groups who participated in this meeting. It was
very constructive and fruitful. Each team came to the meeting with
a proposed set of questions, as outlined in our Palm Springs discussion.
We emphasized again that what we are trying to accomplish is come
up with a prototype evaluation to better understand how a full evaluation
should be conducted.
We started by reviewing the proposals in no particular order to generate
discussions. Here are some highlights:
- MIT used the CNS domain to generate a mix of linguistic and world knowledge
questions. This raised the interesting problem of having some situations
where language alone might generate an "unknown" answer, but world knowledge
might answer the questions. Should we have several possible answer
annotations?
- PARC focused on linguistic phenomena. One discussion focused
on the frequency of these phenomena in corpora. The discussion highlighted
the difference between "lab" data good for scientific investigations and
"field" data good for production. The consensus was to leave both in
this prototype.
- ICSI focused on event structure in terrorism data and metaphor in international
economics. It was a mix of "lab" and "field" data. The discussion
about the need for a metaphor label did not result in a decision to add it,
as the inferences required are more complex.
- UIUC proposed "field" data extending PASCAL.
- ISI emphasized the need to prevent biais in the final evaluation. Randomization
was one method of generating the questions. The concern about biais
in the prototype will be alleviated by the fact that all systems will try
all the proposed questions.
- Stanford proposed a mix of first sentences from newswire and hypotheses
generated from Google news.
- Brandeis proposed examples from TimeBank, focusing on the S-Links used
to understand events (Note that discussions in Dagstuhl may result in changes
in S-links). Many linguistic phenomena were covered.
- ASU proposed questions involving deep inferences, and using world knowledge.
since many instances involve default assumptions, a new label for defaults
was proposed.
- LCC Ferret team proposed some analyst questions generated during the
2004 workshop. Discussion looked at true answers that may be irrelevant
with respect to the original question: would we learn anything from this
exercise? a trick might be to negate the questions, and see whether the answer
would still be true.
- LCC Power answer team focused on possible worlds and contexts when
they are in the text. The team proposed adding context type and confidence
labels. They also had a draft of annotation guidelines.
- Cycorp had a large number of world knowledge "field" question based
on "lab"examples.
The consensus was that we now have a very nice collection of test data covering
both linguistic and world knowledge phenomena, involving simple and complex
inferences, involving both "field" and "lab" data.
The rest of the day was spent making decisions about the form of the challenge
and what our next steps might be. The details are presented in a set
of transparencies written by PARC. The main points are that PASCAL's
entailment pairs will systematically be replaced by questions/answers pairs,
mandatory and optional system output, and mandatory and optional annotations
were defined.
Annotations guidelines will be generated by May 20, a unified XML format
defined by May 25.
We left discussions of evaluation and scoring for the breakout session that
will take place at the June PI meeting.
The question of additional funding was raised. Teams were encouraged
to express their needs in writing to John Prange. At a meeting of the
executive committee on 4/22, John reiterated that he will soon provide the
needed funds.
Thank you again to all the participants for a very productive meeting, and
thanks to PARC for hosting the meeting.
JLPomarede
Last Revised: May 5, 2005