Term-document matrices and singular value decompositions

**Theorem.**
Let be the rank of the
matrix . Then, there is a *singular-value decomposition* ( *SVD* for short) of of the form

- The eigenvalues of are the same as the eigenvalues of ;
- For , let , with . Then the matrix is composed by setting for , and zero otherwise.

The values are referred to as the *singular values* of . It is instructive to examine the relationship of Theorem 18.2 to Theorem 18.1.1; we do this rather than derive the general proof of Theorem 18.2, which is beyond the scope of this book.

By multiplying Equation 232 by its transposed version, we have

Note now that in Equation 233, the left-hand side is a square symmetric matrix real-valued matrix, and the right-hand side represents its *symmetric diagonal decomposition* as in Theorem 18.1.1. What does the left-hand side
represent? It is a square matrix with a row and a column corresponding to each of the terms. The entry in the matrix is a measure of the overlap between the th and th terms, based on their co-occurrence in documents. The precise mathematical meaning depends on the manner in which is constructed based on term weighting. Consider the case where is the term-document *incidence matrix* of page 1.1 , illustrated in Figure 1.1 . Then the entry in
is the number of documents in which both term and term occur.

When writing down the numerical values of the SVD, it is conventional to represent as an matrix with the singular values on the diagonals, since all its entries outside this sub-matrix are zeros. Accordingly, it is conventional to omit the rightmost columns of corresponding to these omitted rows of ; likewise the rightmost columns of are omitted since they correspond in to the rows that will be multiplied by the columns of zeros in . This written form of the SVD is sometimes known as the *reduced SVD* or *truncated SVD* and we will encounter it again in Exercise 18.3 . Henceforth, our numerical examples and exercises will use this reduced form.

**Worked example.**
We now illustrate the singular-value decomposition of a matrix of rank 2; the singular values are
and .

**End worked example.**

As with the matrix decompositions defined in Section 18.1.1 , the singular value decomposition of a matrix can be computed by a variety of algorithms, many of which have been publicly available software implementations; pointers to these are given in Section 18.5 .

**Exercises.**

- Let

be the term-document incidence matrix for a collection. Compute the co-occurrence matrix . What is the interpretation of the diagonal entries of when is a term-document incidence matrix? - Verify that the SVD of the matrix in Equation 235 is

by verifying all of the properties in the statement of Theorem 18.2. - Suppose that is a binary term-document incidence matrix. What do the entries of
represent?
- Let

be a term-document matrix whose entries are term frequencies; thus term 1 occurs 2 times in document 2 and once in document 3. Compute ; observe that its entries are largest where two terms have their most frequent occurrences together in the same document.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07