Optimality of HAC

To state the optimality conditions of hierarchical clustering precisely, we first define the combination similarity COMB-SIM of a clustering $\Omega = \{\omega_1,\ldots,\omega_K\}$ as the smallest combination similarity of any of its

clusters:

$\begin{displaymath} \mbox{{\sc comb-sim}}(\{\omega_1,\ldots,\omega_K\}) = \min_k \mbox{{\sc comb-sim}}(\omega_k) \end{displaymath}$

(210)

We then define $\Omega = \{\omega_1,\ldots,\omega_K\}$ to be optimal if all clusterings $\Omega'$ with

clusters, $k \leq K$ , have lower combination similarities:

$\begin{displaymath} \vert\Omega'\vert \leq \vert\Omega\vert \Rightarrow \mbox{{\sc comb-sim}}(\Omega') \leq \mbox{{\sc comb-sim}}(\Omega) \end{displaymath}$

(211)

Figure 17.12 shows that centroid clustering is not optimal. The clustering $\{\{d_1,d_2\},\{d_3\}\}$ (for

) has combination similarity $-(4-\epsilon)$ and $\{\{d_1,d_2,d_3\}\}$ (for

) has combination similarity -3.46. So the clustering $\{\{d_1,d_2\},\{d_3\}\}$ produced in the first merge is not optimal since there is a clustering with fewer clusters ( $\{\{d_1,d_2,d_3\}\}$ ) that has higher combination similarity. Centroid clustering is not optimal because inversions can occur.

The above definition of optimality would be of limited use if it was only applicable to a clustering together with its merge history. However, we can show (Exercise 17.5 ) that for the three non-inversion algorithms can be read off from the cluster without knowing its history. These direct definitions of combination similarity are as follows.

We can now prove the optimality of single-link clustering by induction over the number of clusters

. We will give a proof for the case where no two pairs of documents have the same similarity, but it can easily be extended to the case with ties.

The inductive basis of the proof is that a clustering with

clusters has combination similarity 1.0, which is the largest value possible. The induction hypothesis is that a single-link clustering $\Omega_K$ with

clusters is optimal:

$\mbox{{\sc comb-sim}} ( \Omega_{K} ) \geq \mbox{{\sc comb-sim}} ( \Omega_{K}' )$ for all $\Omega_{K}'$ . Assume for contradiction that the clustering $\Omega_{K-1}$ we obtain by merging the two most similar clusters in $\Omega_K$ is not optimal and that instead a different sequence of merges $\Omega_K',\Omega_{K-1}'$ leads to the optimal clustering with

clusters. We can write the assumption that $\Omega_{K-1}'$ is optimal and that $\Omega_{K-1}$ is not as $\mbox{{\sc comb-sim}} ( \Omega_{K-1}' ) > \mbox{{\sc comb-sim}} (\Omega_{K-1} )$ .

Case 1: The two documents linked by $s=\mbox{{\sc comb-sim}} ( \Omega_{K-1}')$ are in the same cluster in $\Omega_{K}$ . They can only be in the same cluster if a merge with similarity smaller than

has occurred in the merge sequence producing $\Omega_K$ . This implies $s > \mbox{{\sc comb-sim}} ( \Omega_{K})$ . Thus, $\mbox{{\sc comb-sim}} ( \Omega_{K-1}') = s > \mbox{{\sc comb-sim}} ( \Omega_{K... ... \mbox{{\sc comb-sim}} ( \Omega_{K}') > \mbox{{\sc comb-sim}} ( \Omega_{K-1}')$ . Contradiction.

Case 2: The two documents linked by $s=\mbox{{\sc comb-sim}} ( \Omega_{K-1}')$ are not in the same cluster in $\Omega_{K}$ . But $s = \mbox{{\sc comb-sim}} ( \Omega_{K-1}')>\mbox{{\sc comb-sim}} ( \Omega_{K-1})$ , so the single-link merging rule should have merged these two clusters when processing $\Omega_{K}$ . Contradiction.

In contrast to single-link clustering, complete-link clustering and GAAC are not optimal as this example shows:

$\begin{pspicture}(0,0)(8,2) \par \psdot[dotstyle=x,dotsize=0.15cm](1,1) \psdot[d... ...4,0.5){$d_2$} \rput[b](5,0.5){$d_3$} \rput[b](8,0.5){$d_4$} \par \end{pspicture}$

Both algorithms merge the two points with distance 1 (

and

) first and thus cannot find the two-cluster clustering $\{ \{ d_1,d_2 \}, \{d_3,d_4\} \}$ . But $\{ \{ d_1,d_2 \}, \{d_3,d_4\} \}$ is optimal on the optimality criteria of complete-link clustering and GAAC.

However, the merge criteria of complete-link clustering and GAAC approximate the desideratum of approximate sphericity better than the merge criterion of single-link clustering. In many applications, we want spherical clusters. Thus, even though single-link clustering may seem preferable at first because of its optimality, it is optimal with respect to the wrong criterion in many document clustering applications.

Table 17.1: Comparison of HAC algorithms.

method	combination similarity	time compl.	optimal?	comment
single-link	max inter-similarity of any 2 docs	$\Theta(N^2)$	yes	chaining effect
complete-link	min inter-similarity of any 2 docs	$\Theta(N^2 \log N)$	no	sensitive to outliers
group-average	average of all sims	$\Theta(N^2 \log N)$	no	best choice for most applications
centroid	average inter-similarity	$\Theta(N^2 \log N)$	no	inversions can occur

Table 17.1 summarizes the properties of the four HAC algorithms introduced in this chapter. We recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.

There are two exceptions to this recommendation. First, for non-vector representations, GAAC is not applicable and clustering should typically be performed with the complete-link method.

Second, in some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the first occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space - and not global structure - that is important in this case.

Similarly, we will describe an approach to duplicate detection on the web in Section 19.6 (page 19.6 ) where single-link clustering is used in the guise of the union-find algorithm . Again, the decision whether a group of documents are duplicates of each other is not influenced by documents that are located far away and single-link clustering is a good choice for duplicate detection.