An excellent general review of clustering is (Jain et al., 1999). Early references for specific HAC algorithms are (King, 1967) (single-link), (Sneath and Sokal, 1973) (complete-link, GAAC) and (Lance and Williams, 1967) (discussing a large variety of hierarchical clustering algorithms). The single-link algorithm in Figure 17.9 is similar to Kruskal's algorithm for constructing a minimum spanning tree. A graph-theoretical proof of the correctness of Kruskal's algorithm (which is analogous to the proof in Section 17.5 ) is provided by Cormen et al. (1990, Theorem 23.1). See Exercise 17.10 for the connection between minimum spanning trees and single-link clusterings.
It is often claimed that hierarchical clustering algorithms
produce better clusterings than flat algorithms
(Jain and Dubes (1988, p. 140),
Cutting et al. (1992), Larsen and Aone (1999)) although more
recently there have been experimental results suggesting the
opposite (Zhao and Karypis, 2002). Even without a consensus
on average behavior, there is no doubt that results of EM
and -means are highly variable since they will often
converge to a local optimum of poor quality. The
HAC algorithms we have presented here are
deterministic and thus more predictable.
The complexity of complete-link, group-average and centroid
clustering is sometimes given as
(Day and Edelsbrunner, 1984, Murtagh, 1983, Voorhees, 1985b) because a document similarity computation
is an order of magnitude more expensive than a simple
comparison, the main operation executed in the merging steps
after the
similarity matrix has been computed.
The centroid algorithm described here is due to Voorhees (1985b). Voorhees recommends complete-link and centroid clustering over single-link for a retrieval application. The Buckshot algorithm was originally published by Cutting et al. (1993). Allan et al. (1998) apply single-link clustering to first story detection .
An important HAC technique not discussed here is Ward's method (El-Hamdouchi and Willett, 1986, Ward Jr., 1963), also called minimum variance clustering . In each step, it selects the merge with the smallest RSS (Chapter 16 , page 191 ). The merge criterion in Ward's method (a function of all individual distances from the centroid) is closely related to the merge criterion in GAAC (a function of all individual similarities to the centroid).
Despite its importance for making the results of clustering
useful, comparatively little work has been done on labeling
clusters. Popescul and Ungar (2000) obtain good results
with a combination of and collection frequency of a
term. Glover et al. (2002b) use information gain for
labeling clusters of web pages.
Stein and zu Eissen's approach is ontology-based (2004).
The more complex problem of
labeling nodes in a hierarchy (which requires distinguishing
more general labels for parents from more specific labels
for children) is tackled by Glover et al. (2002a) and
Treeratpituk and Callan (2006). Some clustering
algorithms attempt to find a set of labels first and then
build (often overlapping) clusters around the labels,
thereby avoiding the problem of labeling altogether
(Osinski and Weiss, 2005, Zamir and Etzioni, 1999, Käki, 2005). We know of no
comprehensive study that compares the quality of such
``label-based'' clustering to the clustering algorithms
discussed in this chapter and in Chapter 16 . In principle,
work on multi-document summarization
(McKeown and Radev, 1995) is also applicable to cluster
labeling, but multi-document summaries are usually longer
than the short text fragments needed when labeling
clusters (cf. snippets). Presenting clusters in a way that users can understand is a UI problem. We
recommend reading (Baeza-Yates and Ribeiro-Neto, 1999, ch. 10) for an
introduction to user interfaces in IR.
An example of an efficient divisive algorithm is bisecting
-means (Steinbach et al., 2000).
Spectral clustering algorithms
(Kannan et al., 2000, Dhillon, 2001, Zha et al., 2001, Ng et al., 2001a),
including
principal direction divisive partitioning (PDDP)
(whose bisecting decisions are based on
SVD , see Chapter 18 )
(Boley, 1998, Savaresi and Boley, 2004),
are computationally more expensive than
bisecting
-means, but have the advantage of being deterministic.
Unlike -means and EM, most hierarchical clustering
algorithms do not have a probabilistic
interpretation. Model-based hierarchical clustering
(Kamvar et al., 2002, Vaithyanathan and Dom, 2000, Castro et al., 2004) is an exception.
The evaluation methodology described in Section 16.3 (page 16.3 ) is also applicable to hierarchical clustering. Specialized evaluation measures for hierarchies are discussed by Fowlkes and Mallows (1983), Larsen and Aone (1999) and Sahoo et al. (2006).
The R environment (R Development Core Team, 2005) offers good support for hierarchical clustering. The R function hclust implements single-link, complete-link, group-average, and centroid clustering; and Ward's method. Another option provided is median clustering which represents each cluster by its medoid (cf. k-medoids in Chapter 16 , page 16.4 ). Support for clustering vectors in high-dimensional spaces is provided by the software package CLUTO (http://glaros.dtc.umn.edu/gkhome/views/cluto).