Single-link and complete-link clustering

In single-link clustering or single-linkage clustering , the similarity of two clusters is the similarity of their most similar members (see Figure 17.3 , (a)). This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters' overall structure are not taken into account.

In complete-link clustering or complete-linkage clustering , the similarity of two clusters is the similarity of their most dissimilar members (see Figure 17.3 , (b)). This is equivalent to choosing the cluster pair whose merge has the smallest diameter. This complete-link merge criterion is non-local; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers. A single document far from the center can increase diameters of candidate merge clusters dramatically and completely change the final clustering.

$\includegraphics[width=13.5cm]{rprojectcomplete.eps}$ A dendrogram of a complete-link clustering.The same 30 documents were clustered with single-link clustering in Figure 17.1 .

Figure 17.4 depicts a single-link and a complete-link clustering of eight documents. The first four steps, each producing a cluster consisting of a pair of two documents, are identical. Then single-link clustering joins the upper two pairs (and after that the lower two pairs) because on the maximum-similarity definition of cluster similarity, those two clusters are closest. Complete-link clustering joins the left two pairs (and then the right two pairs) because those are the closest pairs according to the minimum-similarity definition of cluster similarity.

Figure 17.1 is an example of a single-link clustering of a set of documents and Figure 17.5 is the complete-link clustering of the same set. When cutting the last merge in Figure 17.5 , we obtain two clusters of similar size (documents 1-16, from NYSE closing averages to Lloyd's chief / U.S. grilling, and documents 17-30, from Ohio Blue Cross to Clinton signs law). There is no cut of the dendrogram in Figure 17.1 that would give us an equally balanced clustering.

Both single-link and complete-link clustering have graph-theoretic interpretations. Define

to be the combination similarity of the two clusters merged in step

, and

the graph that links all data points with a similarity of at least

. Then the clusters after step

in single-link clustering are the connected components of

and the clusters after step

in complete-link clustering are maximal cliques of

. A connected component is a maximal set of connected points such that there is a path connecting each pair. A clique is a set of points that are completely linked with each other.

These graph-theoretic interpretations motivate the terms single-link and complete-link clustering. Single-link clusters at step

are maximal sets of points that are linked via at least one link (a single link) of similarity $s \geq s_k$ ; complete-link clusters at step

are maximal sets of points that are completely linked with each other via links of similarity $s \geq s_k$ .

$\begin{figure} % latex2html id marker 26560 \par \psset{unit=0.5cm} \par \begin{... ...n single-link clustering can cause undesirable elongated clusters.} \end{figure}$

Single-link and complete-link clustering reduce the assessment of cluster quality to a single similarity between a pair of documents: the two most similar documents in single-link clustering and the two most dissimilar documents in complete-link clustering. A measurement based on one pair cannot fully reflect the distribution of documents in a cluster. It is therefore not surprising that both algorithms often produce undesirable clusters. Single-link clustering can produce straggling clusters as shown in Figure 17.6 . Since the merge criterion is strictly local, a chain of points can be extended for long distances without regard to the overall shape of the emerging cluster. This effect is called chaining .

The chaining effect is also apparent in Figure 17.1 . The last eleven merges of the single-link clustering (those above the

line) add on single documents or pairs of documents, corresponding to a chain. The complete-link clustering in Figure 17.5 avoids this problem. Documents are split into two groups of roughly equal size when we cut the dendrogram at the last merge. In general, this is a more useful organization of the data than a clustering with chains.

$\begin{figure} % latex2html id marker 26574 \par \psset{unit=0.5cm} \par \begin{... ...tering, the outlier $d_1$\ splits $\{d_2,d_3,d_4,d_5\}$\ as shown.} \end{figure}$

However, complete-link clustering suffers from a different problem. It pays too much attention to outliers, points that do not fit well into the global structure of the cluster. In the example in Figure 17.7 the four documents

are split because of the outlier

at the left edge (Exercise 17.2.1 ). Complete-link clustering does not find the most intuitive cluster structure in this example.