Hierarchical clustering

Flat clustering is efficient and conceptually simple, but as
we saw in Chapter 16 it has a number of
drawbacks.
The algorithms introduced in Chapter 16
return a flat unstructured set of clusters,
require a prespecified number of clusters as input and are
nondeterministic.
*Hierarchical clustering* (or
*hierarchic clustering* ) outputs a hierarchy, a structure that is more
informative than the unstructured set of clusters returned by flat
clustering.^{}Hierarchical clustering does not require us to prespecify
the number of clusters and most hierarchical algorithms that
have been used in IR are
deterministic. These advantages of hierarchical clustering
come at the cost of lower efficiency. The most common
hierarchical clustering algorithms have a complexity that is
at least quadratic in the number of documents compared to
the linear complexity of -means and EM (cf. Section 16.4 , page 16.4 ).

This chapter first introduces *agglomerative*
hierarchical clustering (Section 17.1 ) and presents
four different agglomerative algorithms, in
Sections 17.2 -17.4 , which differ in the
similarity measures they employ: single-link, complete-link,
group-average, and centroid similarity. We then discuss the
optimality conditions of hierarchical clustering in
Section 17.5 .
Section 17.6 introduces
top-down (or *divisive*) hierarchical
clustering.
Section 17.7 looks at
labeling clusters automatically, a problem that must be solved whenever
humans interact with the output of clustering. We
discuss
implementation issues in
Section 17.8 . Section 17.9 provides
pointers to further reading, including references to
soft
hierarchical clustering, which we do
not cover in this book.

There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 16.1 ; see also Section 16.6 , page 16.6 ). In fact, the example we gave for collection clustering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern. In addition, many researchers believe that hierarchical clustering produces better clusters than flat clustering. However, there is no consensus on this issue (see references in Section 17.9 ).

- Hierarchical agglomerative clustering
- Single-link and complete-link clustering

- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
- References and further reading
- Exercises

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07