Hierarchical agglomerative clustering

A dendrogram
of a
single-link clustering of 30 documents from Reuters-RCV1.
Two possible cuts of the dendrogram are shown: at 0.4 into 24
clusters and at 0.1 into 12 clusters.

Before looking at specific similarity measures used in HAC in Sections 17.2 -17.4 , we first introduce a method for depicting hierarchical clusterings graphically, discuss a few key properties of HACs and present a simple algorithm for computing an HAC.

An HAC clustering is typically
visualized as a *dendrogram* as shown in
Figure 17.1 . Each merge is represented by a horizontal line.
The y-coordinate of the horizontal line is the
similarity of the two clusters that were
merged,
where documents are viewed as singleton clusters.
We call this similarity
the
*combination similarity* of the merged cluster.
For example, the combination similarity
of the cluster consisting of Lloyd's CEO questioned and
Lloyd's chief / U.S. grilling in Figure 17.1 is .
We define the combination similarity of a
singleton cluster as its document's self-similarity
(which is 1.0 for cosine similarity).

By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering. For example, we see that the two documents entitled War hero Colin Powell were merged first in Figure 17.1 and that the last merge added Ag trade reform to a cluster consisting of the other 29 documents.

A fundamental assumption in HAC is that the merge operation
is *monotonic* . Monotonic means that if
are the combination
similarities of the successive merges of an HAC, then
holds. A non-monotonic hierarchical clustering
contains at least one *inversion* and
contradicts the fundamental assumption that we

chose the best merge available at each step. We will see an example of an inversion in Figure 17.12 .

Hierarchical clustering does not require a prespecified number of clusters. However, in some applications we want a partition of disjoint clusters just as in flat clustering. In those cases, the hierarchy needs to be cut at some point. A number of criteria can be used to determine the cutting point:

- Cut at a prespecified level of similarity. For
example, we cut the dendrogram at 0.4 if we want clusters with a
minimum combination similarity of 0.4.
In Figure 17.1 , cutting the
diagram at yields 24 clusters (grouping only
documents with high similarity together) and cutting it at
yields 12 clusters (one large financial news cluster and 11
smaller clusters).
- Cut the dendrogram where the gap between two successive combination similarities is largest. Such large gaps arguably indicate ``natural'' clusterings. Adding one more cluster decreases the quality of the clustering significantly, so cutting before this steep decrease occurs is desirable. This strategy is analogous to looking for the knee in the -means graph in Figure 16.8 (page 16.8 ).
- Apply Equation 195 (page 16.4.1 ):

where refers to the cut of the hierarchy that results in clusters, RSS is the residual sum of squares and is a penalty for each additional cluster. Instead of RSS, another measure of distortion can be used. - As in flat clustering, we can also prespecify the number of clusters and select the cutting point that produces clusters.

A simple, naive HAC algorithm is shown in
Figure 17.2 .
We first compute the
similarity matrix .
The algorithm then executes
steps of merging the currently most
similar clusters.
In each iteration,
the two most similar clusters are merged and the rows and columns of the
merged cluster in are updated.^{}The clustering is stored as a list of merges in .
indicates which clusters are still available to be
merged. The function
SIM computes the similarity of cluster with
the merge of clusters and . For some HAC algorithms,
SIM is simply a function of and
, for example,
the maximum of these two values for single-link.

We will now refine this algorithm for the different similarity measures of single-link and complete-link clustering (Section 17.2 ) and group-average and centroid clustering ( and 17.4 ). The merge criteria of these four variants of HAC are shown in Figure 17.3 .

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07