Optimality of HAC

To state the optimality conditions of hierarchical
clustering precisely, we first define the *combination
similarity* COMB-SIM of a clustering
as the
smallest combination
similarity of any of its clusters:

(210) |

We then define
to be
*optimal*
if all clusterings with
clusters, , have lower combination
similarities:

(211) |

Figure 17.12 shows that centroid clustering is not optimal. The clustering (for ) has combination similarity and (for ) has combination similarity -3.46. So the clustering produced in the first merge is not optimal since there is a clustering with fewer clusters ( ) that has higher combination similarity. Centroid clustering is not optimal because inversions can occur.

The above definition of optimality would be of limited use if it was only applicable to a clustering together with its merge history. However, we can show (Exercise 17.5 ) that for the three non-inversion algorithms can be read off from the cluster without knowing its history. These direct definitions of combination similarity are as follows.

**single-link**- The combination similarity of a cluster
is the smallest similarity of any bipartition of
the cluster, where the similarity of a bipartition is the
largest similarity between any two documents from the two parts:

(212) **complete-link**- The combination similarity of a cluster is the smallest similarity of any two points in : .
**GAAC**- The combination similarity of a cluster is the average of all pairwise similarities in (where self-similarities are not included in the average): Equation 205.

We can now prove the optimality of single-link clustering by induction over the number of clusters . We will give a proof for the case where no two pairs of documents have the same similarity, but it can easily be extended to the case with ties.

The inductive basis of the proof is that a clustering with clusters has combination similarity 1.0, which is the largest value possible. The induction hypothesis is that a single-link clustering with clusters is optimal:

for all . Assume for contradiction that the clustering we obtain by merging the two most similar clusters in is not optimal and that instead a different sequence of merges leads to the optimal clustering with clusters. We can write the assumption that is optimal and that is not as .

Case 1: The two documents linked by are in the same cluster in . They can only be in the same cluster if a merge with similarity smaller than has occurred in the merge sequence producing . This implies . Thus, . Contradiction.

Case 2: The two documents linked by are not in the same cluster in . But , so the single-link merging rule should have merged these two clusters when processing . Contradiction.

Thus, is optimal.

In contrast to single-link clustering, complete-link clustering and GAAC are not optimal as this example shows:

Both algorithms merge the two points with distance 1 ( and ) first and thus cannot find the two-cluster clustering . But is optimal on the optimality criteria of complete-link clustering and GAAC.

However, the merge criteria of complete-link clustering and GAAC approximate the desideratum of approximate sphericity better than the merge criterion of single-link clustering. In many applications, we want spherical clusters. Thus, even though single-link clustering may seem preferable at first because of its optimality, it is optimal with respect to the wrong criterion in many document clustering applications.

Table 17.1 summarizes the properties of the four HAC algorithms introduced in this chapter. We recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.

There are two exceptions to this recommendation. First, for non-vector representations, GAAC is not applicable and clustering should typically be performed with the complete-link method.

Second, in
some applications
the purpose of clustering is not to create a complete
hierarchy or exhaustive partition of the entire document set.
For instance, *first story detection* or
*novelty detection*
is the task of
detecting the first occurrence of an event in a stream of news stories. One
approach to this task is to find a tight cluster
within the documents that were sent across the wire in a
short period of time and are dissimilar from all previous
documents. For example, the documents sent over the wire in
the minutes after the World Trade Center attack on September
11, 2001 form
such a cluster. Variations of single-link clustering can do
well on this task since it is the structure of small parts
of the vector space - and not global structure - that is important in this case.

Similarly, we will describe an approach to duplicate
detection on the web in Section 19.6
(page 19.6 ) where single-link clustering is used
in the guise of the *union-find
algorithm* . Again, the decision whether a group of
documents are duplicates of each other is not influenced by
documents that are located far away and single-link
clustering is a good choice for duplicate detection.

**Exercises.**

- Show the equivalence
of the two definitions of combination similarity: the
process definition
on page 17.1
and the static definition
on page 17.5 .

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07