References and further reading

Strang (1986) provides an excellent introductory overview
of matrix decompositions including the singular value
decomposition. Theorem 18.3 is due to
Eckart and Young (1936). The connection between information
retrieval and low-rank approximations of the term-document
matrix was introduced in Deerwester et al. (1990), with a subsequent
survey of results in Berry et al. (1995). Dumais (1993)
and Dumais (1995) describe experiments on TREC benchmarks
giving evidence that at least on some benchmarks, LSI can
produce better precision and recall than standard
vector-space
retrieval. `http://www.cs.utk.edu/ ~berry/lsi++/`and

Hofmann (1999a;b) provides an initial
probabilistic extension of the basic latent semantic indexing
technique. A more satisfactory formal basis for a probabilistic latent
variable model for dimensionality reduction is the *Latent
Dirichlet Allocation* ( *LDA* ) model
(Blei et al., 2003), which is generative and assigns probabilities to
documents outside of the training set. This model is extended to a
hierarchical clustering by Rosen-Zvi et al. (2004).
Wei and Croft (2006) present the first large scale evaluation of LDA,
finding it to significantly outperform the query likelihood model of
Section 12.2 (page ), but to not perform quite as well as the relevance
model mentioned in Section 12.4 (page ) - but the latter does
additional per-query processing unlike LDA. Teh et al. (2006)
generalize further by presenting *Hierarchical Dirichlet
Processes* , a probabilistic model which allows a group (for us, a
document) to be drawn from an infinite mixture of latent topics, while
still allowing these topics to be shared across documents.

**Exercises.**

- Assume you have a set of documents each of which is in either English or in Spanish. The collection is given in Figure 18.4 .
Figure 18.5 gives a glossary relating the Spanish and English words above for your own information. This glossary is NOT available to the retrieval system:

- Construct the appropriate term-document matrix to use for a collection consisting of these documents. For simplicity, use raw term frequencies rather than normalized tf-idf weights. Make sure to clearly label the dimensions of your matrix.
- Write down the matrices and and from these derive the rank 2 approximation .
- State succinctly what the entry in the matrix represents.
- State succinctly what the entry in the matrix represents, and why it differs from that in .

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07