The definition assumes that documents are represented as length-normalized vectors in a real-valued space in the familiar way. We used centroids for Rocchio classification in Chapter 14 (page 14.2 ). They play a similar role here. The ideal cluster in -means is a sphere with the centroid as its center of gravity. Ideally, the clusters should not overlap. Our desiderata for classes in Rocchio classification were the same. The difference is that we have no labeled training set in clustering for which we know which documents should be in the same cluster.
A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS , the squared distance of each vector from its centroid summed over all vectors:
We can apply one of the following termination conditions.
We now show that -means converges by proving that
monotonically decreases in each
iteration. We will use decrease in the meaning
decrease or does not change in this section.
First, RSS decreases in the reassignment step
since each vector is assigned to the closest centroid, so
the distance it contributes to decreases.
Second, it decreases in the recomputation step
because the new centroid is the vector
for which reaches its minimum.
Since there is only a finite set of possible clusterings, a monotonically decreasing algorithm will eventually arrive at a (local) minimum. Take care, however, to break ties consistently, e.g., by assigning a document to the cluster with the lowest index if there are several equidistant centroids. Otherwise, the algorithm can cycle forever in a loop of clusterings that have the same cost.
While this proves the convergence of -means, there is unfortunately no guarantee that a global minimum in the objective function will be reached. This is a particular problem if a document set contains many outliers , documents that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS. Figure 16.7 shows an example of a suboptimal clustering resulting from a bad choice of initial seeds.
Another type of suboptimal clustering that frequently occurs is one with empty clusters (Exercise 16.7 ).
Effective heuristics for seed selection include (i) excluding outliers from the seed set; (ii) trying out multiple starting points and choosing the clustering with lowest cost; and (iii) obtaining seeds from another method such as hierarchical clustering. Since deterministic hierarchical clustering methods are more predictable than -means, a hierarchical clustering of a small random sample of size (e.g., for or ) often provides good seeds (see the description of the Buckshot algorithm, Chapter 17 , page 17.8 ).
Other initialization methods compute seeds that are not selected from the vectors to be clustered. A robust method that works well for a large variety of document distributions is to select (e.g., ) random vectors for each cluster and use their centroid as the seed for this cluster. See Section 16.6 for more sophisticated initializations.
What is the time complexity of -means? Most of the time is spent on computing vector distances. One such operation costs . The reassignment step computes distances, so its overall complexity is . In the recomputation step, each vector gets added to a centroid once, so the complexity of this step is . For a fixed number of iterations , the overall complexity is therefore . Thus, -means is linear in all relevant factors: iterations, number of clusters, number of vectors and dimensionality of the space. This means that -means is more efficient than the hierarchical algorithms in Chapter 17 . We had to fix the number of iterations , which can be tricky in practice. But in most cases, -means quickly reaches either complete convergence or a clustering that is close to convergence. In the latter case, a few documents would switch membership if further iterations were computed, but this has a small effect on the overall quality of the clustering.
There is one subtlety in the preceding argument. Even a linear algorithm can be quite slow if one of the arguments of is large, and usually is large. High dimensionality is not a problem for computing the distance between two documents. Their vectors are sparse, so that only a small fraction of the theoretically possible componentwise differences need to be computed. Centroids, however, are dense since they pool all terms that occur in any of the documents of their clusters. As a result, distance computations are time consuming in a naive implementation of -means. However, there are simple and effective heuristics for making centroid-document similarities as fast to compute as document-document similarities. Truncating centroids to the most significant terms (e.g., ) hardly decreases cluster quality while achieving a significant speedup of the reassignment step (see references in Section 16.6 ).
The same efficiency problem is addressed by K-medoids , a variant of -means that computes medoids instead of centroids as cluster centers. We define the medoid of a cluster as the document vector that is closest to the centroid. Since medoids are sparse document vectors, distance computations are fast.