Flat clustering is efficient and conceptually simple, but as
we saw in Chapter 16 it has a number of
drawbacks.
The algorithms introduced in Chapter 16
return a flat unstructured set of clusters,
require a prespecified number of clusters as input and are
nondeterministic.
Hierarchical clustering (or
hierarchic clustering ) outputs a hierarchy, a structure that is more
informative than the unstructured set of clusters returned by flat
clustering.Hierarchical clustering does not require us to prespecify
the number of clusters and most hierarchical algorithms that
have been used in IR are
deterministic. These advantages of hierarchical clustering
come at the cost of lower efficiency. The most common
hierarchical clustering algorithms have a complexity that is
at least quadratic in the number of documents compared to
the linear complexity of
-means and EM (cf. Section 16.4 , page 16.4 ).
This chapter first introduces agglomerative hierarchical clustering (Section 17.1 ) and presents four different agglomerative algorithms, in Sections 17.2 -17.4 , which differ in the similarity measures they employ: single-link, complete-link, group-average, and centroid similarity. We then discuss the optimality conditions of hierarchical clustering in Section 17.5 . Section 17.6 introduces top-down (or divisive) hierarchical clustering. Section 17.7 looks at labeling clusters automatically, a problem that must be solved whenever humans interact with the output of clustering. We discuss implementation issues in Section 17.8 . Section 17.9 provides pointers to further reading, including references to soft hierarchical clustering, which we do not cover in this book.
There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 16.1 ; see also Section 16.6 , page 16.6 ). In fact, the example we gave for collection clustering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern. In addition, many researchers believe that hierarchical clustering produces better clusters than flat clustering. However, there is no consensus on this issue (see references in Section 17.9 ).