Single-link and complete-link clustering

In *single-link clustering* or
*single-linkage clustering* ,
the similarity of two
clusters is the similarity of their *most similar*
members
(see Figure 17.3 , (a))^{}.
This single-link merge criterion is *local*. We pay attention
solely to the area where the two clusters come closest
to each other. Other, more distant parts of the cluster and
the clusters' overall structure are not taken into account.

In *complete-link clustering* or
*complete-linkage
clustering* , the similarity of two clusters is the
similarity of their *most dissimilar* members (see
Figure 17.3 , (b)). This is equivalent to
choosing the cluster pair whose merge has the smallest
diameter. This complete-link merge criterion is non-local;
the entire structure of the clustering can influence merge
decisions. This results in a preference for compact clusters with small diameters
over long, straggly clusters, but also causes
sensitivity to outliers. A single document far from the center
can increase diameters of candidate merge clusters
dramatically and completely change the final clustering.

A dendrogram of a
complete-link clustering.The same 30 documents were
clustered with single-link clustering in
Figure 17.1 .

Figure 17.4 depicts a single-link and
a complete-link clustering of eight documents. The first
four steps, each producing a cluster consisting of a pair of two documents, are
identical. Then single-link clustering joins the upper two
pairs (and after that the lower two pairs) because
on the maximum-similarity definition of cluster
similarity,
those two clusters are closest. Complete-link clustering
joins the left two pairs (and then the right two pairs)
because those are the closest pairs according to the
minimum-similarity definition of cluster
similarity.^{}

Figure 17.1 is an example of a single-link clustering of a set of documents and Figure 17.5 is the complete-link clustering of the same set. When cutting the last merge in Figure 17.5 , we obtain two clusters of similar size (documents 1-16, from NYSE closing averages to Lloyd's chief / U.S. grilling, and documents 17-30, from Ohio Blue Cross to Clinton signs law). There is no cut of the dendrogram in Figure 17.1 that would give us an equally balanced clustering.

Both single-link and complete-link clustering have
graph-theoretic interpretations. Define to be the
*combination similarity* of the two clusters
merged in step , and the graph that links all
data points with a similarity of at least . Then the
clusters after step in single-link clustering are the
connected components of
and the clusters after step in complete-link
clustering are maximal cliques of
. A *connected component* is a maximal set of
connected points such that there is a path connecting each pair. A
*clique* is a set of points that are completely linked with
each other.

These graph-theoretic interpretations motivate the terms single-link and complete-link clustering. Single-link clusters at step are maximal sets of points that are linked via at least one link (a single link) of similarity ; complete-link clusters at step are maximal sets of points that are completely linked with each other via links of similarity .

Single-link and complete-link clustering reduce the
assessment of cluster quality to a single similarity between
a pair of documents: the two most similar documents in
single-link clustering and the two most dissimilar documents
in complete-link clustering. A measurement based on one pair
cannot fully reflect the distribution of documents in a
cluster. It is therefore not surprising that both algorithms
often produce undesirable clusters. Single-link clustering can
produce straggling clusters as shown in
Figure 17.6 . Since the merge criterion is strictly
local, a chain of points can be extended for long distances
without regard to the overall shape of the emerging
cluster. This effect is called *chaining* .

The chaining effect is also apparent in Figure 17.1 . The last eleven merges of the single-link clustering (those above the line) add on single documents or pairs of documents, corresponding to a chain. The complete-link clustering in Figure 17.5 avoids this problem. Documents are split into two groups of roughly equal size when we cut the dendrogram at the last merge. In general, this is a more useful organization of the data than a clustering with chains.

However, complete-link clustering suffers from a different problem. It pays too much attention to outliers, points that do not fit well into the global structure of the cluster. In the example in Figure 17.7 the four documents are split because of the outlier at the left edge (Exercise 17.2.1 ). Complete-link clustering does not find the most intuitive cluster structure in this example.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07