Next: References and further reading Up: Hierarchical clustering Previous: Cluster labeling Contents Index

Implementation notes

Most problems that require the computation of a large number of dot products benefit from an inverted index. This is also the case for HAC clustering. Computational savings due to the inverted index are large if there are many zero similarities - either because many documents do not share any terms or because an aggressive stop list is used.

In low dimensions, more aggressive optimizations are possible that make the computation of most pairwise similarities unnecessary (Exercise 17.10 ). However, no such algorithms are known in higher dimensions. We encountered the same problem in kNN classification (see Section 14.7 , page 14.7 ).

When using GAAC on a large document set in high dimensions, we have to take care to avoid dense centroids. For dense centroids, clustering can take time $\Theta(MN^2\log N )$ where is the size of the vocabulary, whereas complete-link clustering is $\Theta( M_{ave} N^2\log N )$ where $M_{ave}$ is the average size of the vocabulary of a document. So for large vocabularies complete-link clustering can be more efficient than an unoptimized implementation of GAAC. We discussed this problem in the context of -means clustering in Chapter 16 (page 16.4 ) and suggested two solutions: truncating centroids (keeping only highly weighted terms) and representing clusters by means of sparse medoids instead of dense centroids. These optimizations can also be applied to GAAC and centroid clustering.

Even with these optimizations, HAC algorithms are all $\Theta(N^2)$ or $\Theta(N^2 \log N)$ and therefore infeasible for large sets of 1,000,000 or more documents. For such large sets, HAC can only be used in combination with a flat clustering algorithm like -means. Recall that -means requires a set of seeds as initialization (Figure 16.5 , page 16.5 ). If these seeds are badly chosen, then the resulting clustering will be of poor quality. We can employ an HAC algorithm to compute seeds of high quality. If the HAC algorithm is applied to a document subset of size $\sqrt{N}$ , then the overall runtime of -means cum HAC seed generation is $\Theta(N)$ . This is because the application of a quadratic algorithm to a sample of size $\sqrt{N}$ has an overall complexity of $\Theta(N)$ . An appropriate adjustment can be made for an $\Theta(N^2 \log N)$ algorithm to guarantee linearity. This algorithm is referred to as the Buckshot algorithm . It combines the determinism and higher reliability of HAC with the efficiency of -means.

Next: References and further reading Up: Hierarchical clustering Previous: Cluster labeling Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07