next up previous contents index
Next: Divisive clustering Up: Hierarchical clustering Previous: Centroid clustering   Contents   Index

Optimality of HAC

To state the optimality conditions of hierarchical clustering precisely, we first define the combination similarity COMB-SIM of a clustering $\Omega =
\{\omega_1,\ldots,\omega_K\}$ as the smallest combination similarity of any of its $K$ clusters:

\mbox{{\sc comb-sim}}(\{\omega_1,\ldots,\omega_K\}) = \min_k \mbox{{\sc comb-sim}}(\omega_k)
\end{displaymath} (210)

Recall that the combination similarity of a cluster $\omega$ that was created as the merge of $\omega_1$ and $\omega_2$ is the similarity of $\omega_1$ and $\omega_2$ (page 17.1 ).

We then define $\Omega =
\{\omega_1,\ldots,\omega_K\}$ to be optimal if all clusterings $\Omega'$ with $k$ clusters, $k \leq K$, have lower combination similarities:

\leq \vert\Omega\vert
\mbox{{\sc comb-sim}}(\Omega')
\leq \mbox{{\sc comb-sim}}(\Omega)
\end{displaymath} (211)

Figure 17.12 shows that centroid clustering is not optimal. The clustering $\{\{d_1,d_2\},\{d_3\}\}$ (for $K=2$) has combination similarity $-(4-\epsilon)$ and $\{\{d_1,d_2,d_3\}\}$ (for $K=1$) has combination similarity -3.46. So the clustering $\{\{d_1,d_2\},\{d_3\}\}$ produced in the first merge is not optimal since there is a clustering with fewer clusters ( $\{\{d_1,d_2,d_3\}\}$) that has higher combination similarity. Centroid clustering is not optimal because inversions can occur.

The above definition of optimality would be of limited use if it was only applicable to a clustering together with its merge history. However, we can show (Exercise 17.5 ) that for the three non-inversion algorithms can be read off from the cluster without knowing its history. These direct definitions of combination similarity are as follows.

The combination similarity of a cluster $\omega$ is the smallest similarity of any bipartition of the cluster, where the similarity of a bipartition is the largest similarity between any two documents from the two parts:
\mbox{{\sc comb-sim}}(\omega)= \min_{\{\omega' : \omega' \su...
\max_{d_j \in \omega - \omega'}
\mbox{\sc sim} (d_i,d_j)
\end{displaymath} (212)

where each $\langle\omega',\omega - \omega'\rangle$ is a bipartition of $\omega$.
The combination similarity of a cluster $\omega$ is the smallest similarity of any two points in $\omega$: $\min_{d_i \in \omega} \min_{d_j \in \omega} \mbox{\sc sim} (d_i,d_j)$.
The combination similarity of a cluster $\omega$ is the average of all pairwise similarities in $\omega$ (where self-similarities are not included in the average): Equation 205.
If we use these definitions of combination similarity, then optimality is a property of a set of clusters and not of a process that produces a set of clusters.

We can now prove the optimality of single-link clustering by induction over the number of clusters $K$. We will give a proof for the case where no two pairs of documents have the same similarity, but it can easily be extended to the case with ties.

The inductive basis of the proof is that a clustering with $K=N$ clusters has combination similarity 1.0, which is the largest value possible. The induction hypothesis is that a single-link clustering $\Omega_K$ with $K$ clusters is optimal:

$\mbox{{\sc comb-sim}} ( \Omega_{K} ) \geq \mbox{{\sc comb-sim}} (
\Omega_{K}' )$ for all $\Omega_{K}'$. Assume for contradiction that the clustering $\Omega_{K-1}$ we obtain by merging the two most similar clusters in $\Omega_K$ is not optimal and that instead a different sequence of merges $\Omega_K',\Omega_{K-1}'$ leads to the optimal clustering with $K-1$ clusters. We can write the assumption that $\Omega_{K-1}'$ is optimal and that $\Omega_{K-1}$ is not as $\mbox{{\sc comb-sim}} ( \Omega_{K-1}' ) > \mbox{{\sc comb-sim}} (\Omega_{K-1} )$.

Case 1: The two documents linked by $s=\mbox{{\sc comb-sim}} (
\Omega_{K-1}')$ are in the same cluster in $
\Omega_{K}$. They can only be in the same cluster if a merge with similarity smaller than $s$ has occurred in the merge sequence producing $\Omega_K$. This implies $s
> \mbox{{\sc comb-sim}} (
\Omega_{K})$. Thus, $\mbox{{\sc comb-sim}} (
= s
> \mbox{{\sc comb-sim}} (
... \mbox{{\sc comb-sim}} (
> \mbox{{\sc comb-sim}} (
$. Contradiction.

Case 2: The two documents linked by $s=\mbox{{\sc comb-sim}} (
\Omega_{K-1}')$ are not in the same cluster in $
\Omega_{K}$. But $s = \mbox{{\sc comb-sim}} (
\Omega_{K-1}')>\mbox{{\sc comb-sim}} (
\Omega_{K-1})$, so the single-link merging rule should have merged these two clusters when processing $
\Omega_{K}$. Contradiction.

Thus, $\Omega_{K-1}$ is optimal.

In contrast to single-link clustering, complete-link clustering and GAAC are not optimal as this example shows:


Both algorithms merge the two points with distance 1 ($d_2$ and $d_3$) first and thus cannot find the two-cluster clustering $\{ \{ d_1,d_2 \}, \{d_3,d_4\} \}$. But $\{ \{ d_1,d_2 \}, \{d_3,d_4\} \}$ is optimal on the optimality criteria of complete-link clustering and GAAC.

However, the merge criteria of complete-link clustering and GAAC approximate the desideratum of approximate sphericity better than the merge criterion of single-link clustering. In many applications, we want spherical clusters. Thus, even though single-link clustering may seem preferable at first because of its optimality, it is optimal with respect to the wrong criterion in many document clustering applications.

Table 17.1: Comparison of HAC algorithms.
method combination similarity time compl. optimal? comment
single-link max inter-similarity of any 2 docs $\Theta(N^2)$ yes chaining effect
complete-link min inter-similarity of any 2 docs $
\Theta(N^2 \log N)$ no sensitive to outliers
group-average average of all sims $
\Theta(N^2 \log N)$ no best choice for
most applications
centroid average inter-similarity $
\Theta(N^2 \log N)$ no inversions can occur

Table 17.1 summarizes the properties of the four HAC algorithms introduced in this chapter. We recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.

There are two exceptions to this recommendation. First, for non-vector representations, GAAC is not applicable and clustering should typically be performed with the complete-link method.

Second, in some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the first occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space - and not global structure - that is important in this case.

Similarly, we will describe an approach to duplicate detection on the web in Section 19.6 (page 19.6 ) where single-link clustering is used in the guise of the union-find algorithm . Again, the decision whether a group of documents are duplicates of each other is not influenced by documents that are located far away and single-link clustering is a good choice for duplicate detection.


next up previous contents index
Next: Divisive clustering Up: Hierarchical clustering Previous: Centroid clustering   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.