Single-Link, Complete-Link & Average-Link Clustering

There is now an updated and expanded version of this page in form of a book chapter.

Single-Link, Complete-Link & Average-Link Clustering

Hierarchical clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. A hierarchical clustering is often represented as a dendrogram (from Manning et al. 1999).

In complete-link (or complete linkage) hierarchical clustering, we merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance).

In single-link (or single linkage) hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance).

Complete-link clustering can also be described using the concept of clique. Let dn be the diameter of the cluster created in step n of complete-link clustering. Define graph G(n) as the graph that links all data points with a distance of at most dn. Then the clusters after step n are the cliques of G(n). This motivates the term complete-link clustering.

Single-link clustering can also be described in graph theoretical terms. If dn is the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn, then the clusters after step n are the connected components of G(n). A single-link clustering also closely corresponds to a weighted graph's minimum spanning tree.

Average-link (or group average) clustering (defined below) is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects.

Time complexity

Complete-link clustering

The worst case time complexity of complete-link clustering is at most O(n^2 log n). One O(n^2 log n) algorithm is to compute the n^2 distance metric and then sort the distances for each data point (overall time: O(n^2 log n)). After each merge iteration, the distance metric can be updated in O(n). We pick the next pair to merge by finding the smallest distance that is still eligible for merging. If we do this by traversing the n sorted lists of distances, then, by the end of clustering, we will have done n^2 traversal steps. Adding all this up gives you O(n^2 log n). See also Aichholzer et al. 1996 and Day et al. 1984. (My intuition is that complete link clustering is easier than sorting a set of n^2 numbers, so there should be a more efficient algorithm. Let me know if you know of one!)

Single-link clustering

The time complexity of single-link clustering is O(n^2). We first compute all distances in O(n^2). While doing this we also find the smallest distance for each data point and keep them in a next-best-merge array. In each of the n-1 merging steps we then find the smallest distance in the next-best-merge array. We merge the two identified clusters, and update the distance matrix in O(n). Finally, we update the next-best-merge array in O(n) in each step. We can do the latter in O(n) because if the best merge partner for k before merging i and j was either i or j, then after merging i and j the best merge partner for k is the merger of i and j.

Complete-link clustering is harder than single-link clustering because the last sentence does not hold for complete-link clustering: in complete-link clustering, if the best merge partner for k before merging i and j was either i or j, then after merging i and j the best merge partner for k can be a cluster different from the merger of i and j. The reason for this difference between single-link and complete-link is that distance defined as the distance of the two closest members is a local property that is not affected by merging; distance defined as the diameter of a cluster is a non-local property that can change during merging.

Average-link clustering

Average-link clustering merges in each iteration the pair of clusters with the highest cohesion. If our data points are represented as normalized vectors in a Euclidean space, we can define the cohesion G of a cluster C as the average dot product:
   G(C) = 1/[n(n-1)] (gamma(C)-n)
where
   n = !C!,
   gamma(C) = sum(v in C) sum(w in C) <v,w>
and
   <v,w> is the dot product of v and w.

If we maintain an array of cluster centroids p(C), where p(C) = [sum(v in C) v], for the currently active clusters, then gamma of the merger of C1 and C2, and hence its cohesion, can be computed recursively as
    gamma(C1+C2) = sum(v in C1+C2)sum(w in C1+C2)<v,w>
                               = <sum(v in C1+C2)v,sum(w in C1+C2)w>
                               = <p(C1+C2),p(C1+C2)>

Based on this recursive computation of cohesion, the time complexity of average-link clustering is O(n^2 log n). We first compute all n^2 similarities for the singleton clusters, and sort them for each cluster (time: O(n^2 log n)). In each of O(n) merge iterations, we identify the pair of clusters with the highest cohesion in O(n); merge the pair; and update cluster centroids, gammas and cohesions of the O(n) possible mergers of the just created cluster with the remaining clusters. For each cluster, we also update the sorted list of merge candidates by deleting the two just merged clusters and inserting its cohesion with the just created cluster. Each iteration thus takes O(n log n). Overall time complexity is then O(n^2 log n).

Note that computing cluster centroids, gammas and cohesions will take time linear in the number of unique terms T in the document collection in the worst case. In contrast, complete-link clustering is bounded by the maximum document length D_max (if we compute similarity using the cosine for data points represented as vectors). So complete-link clustering is O(n^2 log n D_max) whereas average-link clustering is O(n^2 log n T). This means that we usually need to employ some form of dimensionality reduction for efficient average-link clustering (Schütze and Silverstein, 1997).

Also, there are only O(n^2) pairwise similarities in complete-link clustering since the similarity between two clusters is ultimately the similarity between two vectors. In average-link clustering, every subset of vectors can have a different cohesion, so we cannot precompute all possible cluster-cluster similarities.

Implementation

This python program implements three complete-link clustering algorithms: the naive cubic algorithm, Murtagh's algorithm, and the O(n^2 log n) algorithm described above. It generates data that indicate the following:

All three algorithms generate the same clustering (and therefore are correct).
Run: python clink.py compare random
Murtagh's algorithm runs in close to quadratic time for random data.
Run: python clink.py timing random
Data from one run in worksheet "randomquad"
Murtagh's algorithm runs in cubic time for hard data.
Run: python clink.py timing hard
Data from one run in worksheets "hardquad" and "hardcube"

References

Oswin Aichholzer and Franz Aurenhammer. 1996. Classifying Hyperplanes in Hypercubes. SIAM Journal on Discrete Mathematics. Volume 9, Number 2, pp. 225-232.
Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
William H. Day and Herbert Edelsbrunner. 1984. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification. Volume 1, pp. 1-24.
D. Defays. 1977. An Efficient Algorithm for a Complete Link Method. The Computer Journal. Volume 20, Number 4, pp. 364-366.
A. K. Jain, M. N. Murty, P. J. Flynn. 1999. Data Clustering: A Review. ACM Computing Surveys (CSUR), Volume 31, Issue 3, ACM Press, New York, pp. 264-323. A bound of O(n^2 log n) is given for the time complexity of single-link clustering (Table 1, page 293, web-accessible pdf), but the above reasoning suggests that O(n^2) is a tighter bound.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
Fionn Murtagh. 1983. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal. Volume 26, Number 4, pp. 354-359.
S.D. Lang, L.-J. Mao, and W.-L. Hsu. 1999. Probabilistic Analysis of the RNN-CLINK Clustering Algorithm. In: Belur V. Dasarathy. Data Mining and Knowledge Discovery: Theory, Tools, and Technology. SPIE Proceedings Vol. 3695.
Hinrich Schütze and Craig Silverstein. 1997. Projections for Efficient Document Clustering. SIGIR, pp. 74-81.

Acknowledgments. I would like to thank Mark Craven, Ido Dagan, Chris Manning, Ray Mooney, Fionn Murtagh, Jan Pedersen, Edie Rasmussen, Yiming Yang, and Jian Zhang for discussions and contributions to the above.