Hierarchical clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. A hierarchical clustering is often represented as a dendrogram (from Manning et al. 1999).

In complete-link (or complete linkage) hierarchical clustering, we merge
in each step the two clusters whose merger has the
smallest diameter (or: the two clusters with the
smallest **maximum** pairwise distance).

In single-link (or single linkage) hierarchical clustering, we merge
in each step the two clusters whose two closest
members have the smallest distance
(or: the two clusters with the
smallest **minimum** pairwise distance).

Complete-link clustering can also be described using the concept of clique. Let dn be the diameter of the cluster created in step n of complete-link clustering. Define graph G(n) as the graph that links all data points with a distance of at most dn. Then the clusters after step n are the cliques of G(n). This motivates the term complete-link clustering.

Single-link clustering can also be described in graph theoretical terms. If dn is the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn, then the clusters after step n are the connected components of G(n). A single-link clustering also closely corresponds to a weighted graph's minimum spanning tree.

Average-link (or group average) clustering (defined below) is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects.

Complete-link clustering is harder than single-link clustering because the last sentence does not hold for complete-link clustering: in complete-link clustering, if the best merge partner for k before merging i and j was either i or j, then after merging i and j the best merge partner for k can be a cluster different from the merger of i and j. The reason for this difference between single-link and complete-link is that distance defined as the distance of the two closest members is a local property that is not affected by merging; distance defined as the diameter of a cluster is a non-local property that can change during merging.

G(C) = 1/[n(n-1)] (gamma(C)-n)

where

n = !C!,

gamma(C) = sum(v in C) sum(w in C) <v,w>

and

<v,w> is the dot product of v and w.

If we maintain an array of cluster centroids
p(C), where p(C) = [sum(v in C) v], for the currently active clusters,
then gamma of the merger of C1 and C2, and hence
its cohesion, can be computed recursively as

gamma(C1+C2) = sum(v in C1+C2)sum(w in C1+C2)<v,w>

= <sum(v in C1+C2)v,sum(w in C1+C2)w>

=
<p(C1+C2),p(C1+C2)>

Based on this recursive computation of cohesion, the time complexity of average-link clustering is O(n^2 log n). We first compute all n^2 similarities for the singleton clusters, and sort them for each cluster (time: O(n^2 log n)). In each of O(n) merge iterations, we identify the pair of clusters with the highest cohesion in O(n); merge the pair; and update cluster centroids, gammas and cohesions of the O(n) possible mergers of the just created cluster with the remaining clusters. For each cluster, we also update the sorted list of merge candidates by deleting the two just merged clusters and inserting its cohesion with the just created cluster. Each iteration thus takes O(n log n). Overall time complexity is then O(n^2 log n).

Note that computing cluster centroids, gammas and cohesions will take time linear in the number of unique terms T in the document collection in the worst case. In contrast, complete-link clustering is bounded by the maximum document length D_max (if we compute similarity using the cosine for data points represented as vectors). So complete-link clustering is O(n^2 log n D_max) whereas average-link clustering is O(n^2 log n T). This means that we usually need to employ some form of dimensionality reduction for efficient average-link clustering (Schütze and Silverstein, 1997).

Also, there are only O(n^2) pairwise similarities in complete-link clustering since the similarity between two clusters is ultimately the similarity between two vectors. In average-link clustering, every subset of vectors can have a different cohesion, so we cannot precompute all possible cluster-cluster similarities.

- All three algorithms generate the same
clustering
(and therefore are correct).

Run: python clink.py compare random - Murtagh's algorithm runs in close to quadratic time
for random data.

Run: python clink.py timing random

Data from one run in worksheet "randomquad" - Murtagh's algorithm runs in cubic time for
hard data.

Run: python clink.py timing hard

Data from one run in worksheets "hardquad" and "hardcube"

- Oswin Aichholzer and Franz Aurenhammer. 1996. Classifying Hyperplanes in Hypercubes. SIAM Journal on Discrete Mathematics. Volume 9, Number 2, pp. 225-232.
- Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
- William H. Day and Herbert Edelsbrunner. 1984. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification. Volume 1, pp. 1-24.
- D. Defays. 1977. An Efficient Algorithm for a Complete Link Method. The Computer Journal. Volume 20, Number 4, pp. 364-366.
- A. K. Jain, M. N. Murty, P. J. Flynn. 1999. Data Clustering: A Review. ACM Computing Surveys (CSUR), Volume 31, Issue 3, ACM Press, New York, pp. 264-323. A bound of O(n^2 log n) is given for the time complexity of single-link clustering (Table 1, page 293, web-accessible pdf), but the above reasoning suggests that O(n^2) is a tighter bound.
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
- Fionn Murtagh. 1983. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal. Volume 26, Number 4, pp. 354-359.
- S.D. Lang, L.-J. Mao, and W.-L. Hsu. 1999. Probabilistic Analysis of the RNN-CLINK Clustering Algorithm. In: Belur V. Dasarathy. Data Mining and Knowledge Discovery: Theory, Tools, and Technology. SPIE Proceedings Vol. 3695.
- Hinrich Schütze and Craig Silverstein. 1997. Projections for Efficient Document Clustering. SIGIR, pp. 74-81.

**Acknowledgments.** I would like to thank
Mark Craven,
Ido Dagan,
Chris Manning,
Ray Mooney,
Fionn Murtagh,
Jan Pedersen,
Edie Rasmussen,
Yiming Yang,
and Jian Zhang
for discussions and contributions to the above.