next up previous contents index
Next: A note on terminology. Up: Flat clustering Previous: Clustering in information retrieval   Contents   Index


Problem statement

We can define the goal in hard flat clustering as follows. Given (i) a set of documents $D = \{d_1,\ldots,d_N \}$, (ii) a desired number of clusters $K$, and (iii) an objective function that evaluates the quality of a clustering, we want to compute an assignment $\gamma: D \rightarrow \{1,\ldots,K\}$ that minimizes (or, in other cases, maximizes) the objective function. In most cases, we also demand that $\gamma $ is surjective, i.e., that none of the $K$ clusters is empty.

The objective function is often defined in terms of similarity or distance between documents. Below, we will see that the objective in $K$-means clustering is to minimize the average distance between documents and their centroids or, equivalently, to maximize the similarity between documents and their centroids. The discussion of similarity measures and distance metrics in Chapter 14 (page 14.1 ) also applies to this chapter. As in Chapter 14 , we use both similarity and distance to talk about relatedness between documents.

For documents, the type of similarity we want is usually topic similarity or high values on the same dimensions in the vector space model. For example, documents about China have high values on dimensions like Chinese, Beijing, and Mao whereas documents about the UK tend to have high values for London, Britain and Queen. We approximate topic similarity with cosine similarity or Euclidean distance in vector space (Chapter 6 ). If we intend to capture similarity of a type other than topic, for example, similarity of language, then a different representation may be appropriate. When computing topic similarity, stop words can be safely ignored, but they are important cues for separating clusters of English (in which the occurs frequently and la infrequently) and French documents (in which the occurs infrequently and la frequently).



Subsections
next up previous contents index
Next: A note on terminology. Up: Flat clustering Previous: Clustering in information retrieval   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07