As an alternative to the cost of a manual thesaurus, we could attempt to generate a thesaurus automatically by analyzing a collection of documents. There are two main approaches. One is simply to exploit word cooccurrence. We say that words co-occurring in a document or paragraph are likely to be in some sense similar or related in meaning, and simply count text statistics to find the most similar words. The other approach is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies. For example, we say that entities that are grown, cooked, eaten, and digested, are more likely to be food items. Simply using word cooccurrence is more robust (it cannot be misled by parser errors), but using grammatical relations is more accurate.
The simplest way to compute a co-occurrence thesaurus is based on
term-term similarities. We begin with a term-document matrix , where each cell
is a weighted count
for term
and document
, with weighting so
has length-normalized rows. If we then calculate
, then
is a
similarity score between terms
and
, with a larger number being better.
Figure 9.8 shows an example of a thesaurus
derived in basically this manner, but with an extra step of
dimensionality reduction via Latent Semantic Indexing, which we
discuss in Chapter 18 . While some of the thesaurus terms
are good or at
least suggestive, others are marginal or bad. The quality of the
associations is typically a problem. Term ambiguity easily
introduces irrelevant statistically correlated terms. For example, a query for
Apple computer may expand to Apple red fruit computer.
In general these thesauri suffer from both false positives
and false negatives. Moreover, since the terms in the
automatic thesaurus are highly correlated in documents anyway (and often
the collection used to derive the thesaurus is the same as the one being
indexed), this form of query expansion may not retrieve many additional
documents.
Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological developments within a field. In general a domain-specific thesaurus is required: general thesauri and dictionaries give far too little coverage of the rich domain-particular vocabularies of most scientific fields. However, query expansion may also significantly decrease precision, particularly when the query contains ambiguous terms. For example, if the user searches for interest rate, expanding the query to interest rate fascinate evaluate is unlikely to be useful. Overall, query expansion is less successful than relevance feedback, though it may be as good as pseudo relevance feedback. It does, however, have the advantage of being much more understandable to the system user.
Exercises.