To state the optimality conditions of hierarchical
clustering precisely, we first define the combination
similarity COMB-SIM of a clustering
as the
smallest combination
similarity of any of its
clusters:
![]() |
(210) |
We then define
to be
optimal
if all clusterings
with
clusters,
, have lower combination
similarities:
![]() |
(211) |
Figure 17.12 shows that
centroid clustering is not optimal. The clustering
(for
) has combination
similarity
and
(for
) has combination similarity -3.46.
So the clustering
produced in the first merge
is not optimal since there is a clustering with
fewer clusters (
) that has higher combination
similarity. Centroid clustering is not optimal because
inversions can occur.
The above definition of optimality would be of limited use if it was only applicable to a clustering together with its merge history. However, we can show (Exercise 17.5 ) that for the three non-inversion algorithms can be read off from the cluster without knowing its history. These direct definitions of combination similarity are as follows.
![]() |
(212) |
We can now prove the optimality of single-link
clustering by induction over the number of clusters . We
will give a proof for the case where no two pairs of
documents have the same similarity,
but it can easily be extended to the case with ties.
The inductive basis of the proof is that a clustering with
clusters has combination similarity 1.0, which
is the largest value possible. The
induction hypothesis is that a single-link clustering
with
clusters
is optimal:
for all
.
Assume for contradiction that the clustering
we obtain by merging
the two most similar clusters in
is not optimal
and that instead a different sequence of merges
leads to the optimal clustering
with
clusters.
We can write the assumption that
is optimal
and that
is not
as
.
Case 1: The two documents linked by
are in the same cluster in
. They can only be in the same cluster if a
merge with similarity smaller than
has occurred in the merge sequence producing
.
This implies
. Thus,
. Contradiction.
Case 2: The two documents linked by
are not in the same cluster in
. But
, so
the single-link merging rule should have merged these two
clusters when processing
. Contradiction.
Thus, is optimal.
In contrast to single-link clustering, complete-link clustering and GAAC are not optimal as this example shows:
Both algorithms merge the two points with distance 1 (
and
) first and thus cannot find the two-cluster
clustering
.
But
is optimal on the
optimality criteria of complete-link clustering and GAAC.
However, the merge criteria of complete-link clustering and GAAC approximate the desideratum of approximate sphericity better than the merge criterion of single-link clustering. In many applications, we want spherical clusters. Thus, even though single-link clustering may seem preferable at first because of its optimality, it is optimal with respect to the wrong criterion in many document clustering applications.
Table 17.1 summarizes the properties of the four HAC algorithms introduced in this chapter. We recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.
There are two exceptions to this recommendation. First, for non-vector representations, GAAC is not applicable and clustering should typically be performed with the complete-link method.
Second, in some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the first occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space - and not global structure - that is important in this case.
Similarly, we will describe an approach to duplicate detection on the web in Section 19.6 (page 19.6 ) where single-link clustering is used in the guise of the union-find algorithm . Again, the decision whether a group of documents are duplicates of each other is not influenced by documents that are located far away and single-link clustering is a good choice for duplicate detection.
Exercises.