PROJECT REPORT

1. The Objective

This project is an attempt to extract a hyponym hierarchy from a given corpus.
Given a corpus, the program tries to find hyponym/hypernym, also called isa,
relationships between nouns in the corpus.
 

2. The Method

The algorithm employed here proceeds in two stages: syntactical extraction and
statistical expansion.

In the syntactical extraction stage, isa relationships are discovered based on
syntactic clues.

As an example, the pattern
 NP0 such as {NP1, NP2 ..., (and | or)} NPn
as exemplified by the phrase
 automobiles such as cars, trucks, and vans
gives strong indication of isa relationships. Furthermore, it does so in a
quite domain-independent manner. So for the pattern given above, we can infer
that hyponym(NPi, NP0) for all 1 <= i <= n, expecting the result to be
reasonable in general.

Such syntactic patterns can be obtained manually or automatically. One can just
postulate patterns out of his own linguistic knowledge. Or programs can
employed to learn them from sample isa relationships.

Given a set of syntactic patterns, the algorithm extracts isa relationships
from the corpus as indicated by those patterns and builds the hyponym relation.
The hyponym relation can be thought as encoding an isa hierarchy in that the
hierarchy can be recovered by taking the reflexive and transitive closure of
the hyponym relation.

In the statistical expansion stage, the algorithm seeks to enrich the hyponym
relation built in the first stage through statistic methods.

For each hypernym discovered in stage 1, its hyponyms (according to the hyponym
relation) are collected and considered seed words. The algorithm then add to
the set of seed words concepts from the corpus that are most closely related
to the seed words. Entries in the resulting (enriched) seed word set are
potential hyponyms for the given hypernym. That is, the hyponym relation can
then be enriched by adding the newly obtained isa relationships to it.

The concepts contained in a corpus are nouns selected according to some
criterion. The relatedness of nouns in a corpus is based on the
co-occurrences of nouns in the corpus.
 

3. The Implementation

The implementation works on a parsed corpus. A parser is required for unparsed
corpora.

o Syntactical Extraction
The implementation first uses tgrep to extract fragments of the corpus that
match a syntactic pattern. A sample tgrep command is
tgrep -a 'NP < (`NP $.. (PP < (JJ < such $.. (IN < as $.. `/^NP/))))'.
The fragments are output to a file for future use.

Then, a Java module reads in the fragments, converts it into an internal
representation, extracts isa relationships, and builds the hyponym relation.

o Statistical Expansion
During this stage, the implementation first use tgrep to output relevant
portion of the corpus to a file. The tgrep command used is
tgrep -a 'NP < (`/^NN/ $. (/CC/|/,/ $.. `/^NN/))'.

This file is read in by a Java module and a co-occurrence bigram between nouns
is generated from it. This bigram provides the basis for determining which
nouns in the corpus are to be added to the seed word set.

To decide the relatedness of a noun to the seed words, a scoring function is
defined as
S(n) = n's co-occurrence with seed words / n's co-occurrence with any word,
where either of the co-occurrences is computed using the co-occurrence bigram.

For each hypernym according to the hyponym relation computed in the syntactical
extraction stage, its hyponyms according to the hyponym relation are collected
into a seed word set. The algorithm then proceeds as follows:
 (1) For each noun mentioned in the co-occurrence bigram, it is scored
        with the scoring function given above.
 (2) The highest score of all non-seed words is determined, and all
        nouns with that score are added to the seed word set. Then return
        to step (1) and repeat. This iteration continues a number of times.
 (3) Any noun that was not selected during the preceding iteration is
        discarded. The seed word set is restored to its original members.
 (4) Each remaining noun is given a score by the scoring function.
 (5) The highest score of all non-seed words is determined, and all
        nouns with that score are added to the seed word set. Then return
        to step (4) and repeat. This iteration continues the same number of
        times as the iteration in step (2).

When this process terminates, the resulting (enriched) seed word set contains
an expanded collection of hyponyms for the hypernym currently under
consideration.
 

4. The Evaluation

Each hyponym(h1, h2) claim generated by the implementation is passed to an
verification module, which uses WordNet as the basis of verification.

For the given pair h1 and h2, there are three potential outcomes of
verification.

                             hyponym of h2 according to the database.                              hyponym of h2 according to the database.


Each local fragment of the hyponym hierarchy generated by the implementation,
that is, the set of hyponym(h1, h2) claims with fixed h2, is then evaluated
by computing the percentage of the verified/rejected/augmented instances.
 

5. Sample Results

Sample results are obtained by running the system on Penn-Treebank parsed
brown and wall street journal corpora. In general, the results vary.

There are some successful cases. For example,

Category: country
Portugal Verify
Jamaica  Verify
Peru  Verify
Barbados Verify
country  Verify
Greece  Verify
Korea  Reject
Spain  Verify
Taiwan  Reject
Singapore Verify
Italy  Verify
Philippines Verify
Sardinia Reject
Turkey  Verify
France  Verify
Albertville Augment
Pakistan Verify
Belgium  Verify

77.77777777777777% Verified;
16.666666666666668% Rejected;
5.555555555555555% Augmented.

In this case, 'country' is the hypernym and the inferred hyponyms are listed
afterward, each with the outcome of its verification. The result is basically
good, though the fact that Korea is rejected as a country indicates the
incompleteness of the WordNet.

This incompleteness is more evident in the following cases:

Category: nation
nation  Verify
Germany  Reject
Scandinavia Reject
Italy  Reject
Sardinia Reject
France  Reject
Turkey  Reject

14.285714285714286% Verified;
85.71428571428571% Rejected;
0.0% Augmented.

Category: ailment
rheumatism Reject
diabetes Reject
arthritis Reject
ailment  Verify
cancer  Reject

20.0% Verified;
80.0% Rejected;
0.0% Augmented.

There are cases, however, in which the system performs inadequately. As in

Category: crime
violence Reject
homicide Reject
crime  Verify
God  Reject
animosity Reject
sport  Reject
sex  Reject
pregnancy Reject
recreation Reject
portion  Reject
club  Reject
event  Reject
race  Reject
religion Reject

7.142857142857143% Verified;
92.85714285714286% Rejected;
0.0% Augmented.
 

6. Improvements Made

In the initial implementation of the system, whole noun phrases are used to
represent concepts. This approach has problem due to data sparseness. A whole
phrase is much less likely to appear in a corpus than just a single noun. As
a result, only the head noun of a noun phrase is used to represent a concept.
This improves the performance of the system.

Another improvement is to canonicalize word forms. Plural nouns are converted
into singulars before being processed. This way, 'cars' and 'car' will be the
same noun in the eyes of the system. This also improves the performance of the
system.
 

7. Related Work

The project is based on two papers: 'Automatic Acquisition of Hyponyms from
Large Text Corpora' by Marti Hearst and 'Noun-phrase co-occurrence statistics
for semi-automatic semantic lexicon construction' by Brian Roark and Eugene
Charniak.

Stage 1 of the system utilizes ideas in Hearst's paper and stage 2 uses those
in Roark and Charniak's.

Hearst's technique is similar to the pattern-based interpretation used in MRD
processing. For example, (Alshawi 1987), in interpreting LDOCE definitions,
uses a hierarchy of patterns which consist mainly of part-of-speech indicators
and wildcard characters. (Markowitz et al. 1986), (Jensen & Binot 1987), and
(Nakamura & Nagoa 1988) also use pattern recognition to extract semantic
relations such as taxonomy from various dictionaries.

Roark and Charniak's work is based on that of Riloff and Shepherd (1997), which
uses noun co-occurrence statistics to indicate nominal category membership for
the purpose of aiding in the construction of semantic lexicons.
 

8. The Validity of Linguistic Assumptions of the Approach

This approach relies heavily on syntactic clues in the discovery of semantic
relations.

For example, the 'such as' pattern is considered to be indicative of isa
relationships. This is in many cases a valid assumption. So from 'automobiles
such as cars and trucks' one can infer, quite reasonably, that a car is an
automobile and so is a truck. However, this does not always work, as in the
case 'a difficult time such as this'. A more interesting case is 'an
embarrassment such as Mr. Shepherd'.

The problem seems to be that people often use language in flexible, creative,
or undisciplined manners and rely on context to fill in what is missing. It is
a challenge in NLP to detect such situations and cope with them successfully.
 

9. Important Files of the Submission

10. Third-Party Software Incorporated

This implementation utilizes the JWordNet software installed under
/afs/ir/class/cs224n/bin/JWordnet during its evaluation process.