References and further reading

An excellent general review of clustering
is (Jain et al., 1999).
Early references for specific HAC algorithms are (King, 1967) (single-link),
(Sneath and Sokal, 1973) (complete-link, GAAC) and
(Lance and Williams, 1967) (discussing a large variety of
hierarchical clustering algorithms). The single-link
algorithm in Figure 17.9 is similar to
*Kruskal's algorithm* for
constructing a
minimum spanning
tree. A graph-theoretical proof of the correctness of
Kruskal's algorithm (which is
analogous to the proof in Section 17.5 ) is provided by
Cormen et al. (1990, Theorem 23.1). See Exercise 17.10
for the connection between minimum spanning trees and
single-link clusterings.

It is often claimed that hierarchical clustering algorithms produce better clusterings than flat algorithms (Jain and Dubes (1988, p. 140), Cutting et al. (1992), Larsen and Aone (1999)) although more recently there have been experimental results suggesting the opposite (Zhao and Karypis, 2002). Even without a consensus on average behavior, there is no doubt that results of EM and -means are highly variable since they will often converge to a local optimum of poor quality. The HAC algorithms we have presented here are deterministic and thus more predictable.

The complexity of complete-link, group-average and centroid clustering is sometimes given as (Day and Edelsbrunner, 1984, Murtagh, 1983, Voorhees, 1985b) because a document similarity computation is an order of magnitude more expensive than a simple comparison, the main operation executed in the merging steps after the similarity matrix has been computed.

The centroid algorithm described here is due to
Voorhees (1985b). Voorhees recommends
complete-link and centroid clustering over single-link for a
retrieval application. The Buckshot algorithm was
originally published by Cutting et al. (1993).
Allan et al. (1998)
apply
single-link clustering to *first story detection* .

An important HAC technique not discussed here is
*Ward's method*
(El-Hamdouchi and Willett, 1986, Ward Jr., 1963), also
called
*minimum variance clustering* . In each step, it
selects the merge with the smallest RSS
(Chapter 16 , page 191 ). The merge criterion
in Ward's method (a function of all individual distances
from the centroid) is closely related to the merge criterion
in GAAC (a function of all individual similarities to the
centroid).

Despite its importance for making the results of clustering
useful, comparatively little work has been done on labeling
clusters. Popescul and Ungar (2000) obtain good results
with a combination of and collection frequency of a
term. Glover et al. (2002b) use information gain for
labeling clusters of web pages.
Stein and zu Eissen's approach is ontology-based (2004).
The more complex problem of
labeling nodes in a hierarchy (which requires distinguishing
more general labels for parents from more specific labels
for children) is tackled by Glover et al. (2002a) and
Treeratpituk and Callan (2006). Some clustering
algorithms attempt to find a set of labels first and then
build (often overlapping) clusters around the labels,
thereby avoiding the problem of labeling altogether
(Osinski and Weiss, 2005, Zamir and Etzioni, 1999, Käki, 2005). We know of no
comprehensive study that compares the quality of such
``label-based'' clustering to the clustering algorithms
discussed in this chapter and in Chapter 16 . In principle,
work on multi-document *summarization*
(McKeown and Radev, 1995) is also applicable to cluster
labeling, but multi-document summaries are usually longer
than the short text fragments needed when labeling
clusters (cf. snippets). Presenting clusters in a way that users can understand is a UI problem. We
recommend reading (Baeza-Yates and Ribeiro-Neto, 1999, ch. 10) for an
introduction to user interfaces in IR.

An example of an efficient divisive algorithm is bisecting
-means (Steinbach et al., 2000).
*Spectral clustering* algorithms
(Kannan et al., 2000, Dhillon, 2001, Zha et al., 2001, Ng et al., 2001a),
including
*principal direction divisive partitioning* (PDDP)
(whose bisecting decisions are based on
*SVD* , see Chapter 18 )
(Boley, 1998, Savaresi and Boley, 2004),
are computationally more expensive than
bisecting
-means, but have the advantage of being deterministic.

Unlike -means and EM, most hierarchical clustering algorithms do not have a probabilistic interpretation. Model-based hierarchical clustering (Kamvar et al., 2002, Vaithyanathan and Dom, 2000, Castro et al., 2004) is an exception.

The evaluation methodology described in Section 16.3 (page 16.3 ) is also applicable to hierarchical clustering. Specialized evaluation measures for hierarchies are discussed by Fowlkes and Mallows (1983), Larsen and Aone (1999) and Sahoo et al. (2006).

The R environment (R Development Core Team, 2005) offers good support for
hierarchical clustering. The R function hclust
implements single-link, complete-link, group-average, and
centroid clustering; and Ward's method. Another option
provided is median clustering which represents each
cluster by its medoid (cf. k-medoids in
Chapter 16 , page 16.4 ). Support for
clustering vectors in high-dimensional spaces is provided by
the software package CLUTO
(`http://glaros.dtc.umn.edu/gkhome/views/cluto`).

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07