There is now an updated and expanded version of this page in form of a book chapter.
Hierarchical clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. A hierarchical clustering is often represented as a dendrogram (from Manning et al. 1999).
In complete-link (or complete linkage) hierarchical clustering, we merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance).
In single-link (or single linkage) hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance).
Complete-link clustering can also be described using the concept of clique. Let dn be the diameter of the cluster created in step n of complete-link clustering. Define graph G(n) as the graph that links all data points with a distance of at most dn. Then the clusters after step n are the cliques of G(n). This motivates the term complete-link clustering.
Single-link clustering can also be described in graph theoretical terms. If dn is the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn, then the clusters after step n are the connected components of G(n). A single-link clustering also closely corresponds to a weighted graph's minimum spanning tree.
Average-link (or group average) clustering (defined below) is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects.
Complete-link clustering is harder than single-link clustering because the last sentence does not hold for complete-link clustering: in complete-link clustering, if the best merge partner for k before merging i and j was either i or j, then after merging i and j the best merge partner for k can be a cluster different from the merger of i and j. The reason for this difference between single-link and complete-link is that distance defined as the distance of the two closest members is a local property that is not affected by merging; distance defined as the diameter of a cluster is a non-local property that can change during merging.
If we maintain an array of cluster centroids
p(C), where p(C) = [sum(v in C) v], for the currently active clusters,
then gamma of the merger of C1 and C2, and hence
its cohesion, can be computed recursively as
gamma(C1+C2) = sum(v in C1+C2)sum(w in C1+C2)<v,w>
= <sum(v in C1+C2)v,sum(w in C1+C2)w>
=
<p(C1+C2),p(C1+C2)>
Based on this recursive computation of cohesion, the time complexity of average-link clustering is O(n^2 log n). We first compute all n^2 similarities for the singleton clusters, and sort them for each cluster (time: O(n^2 log n)). In each of O(n) merge iterations, we identify the pair of clusters with the highest cohesion in O(n); merge the pair; and update cluster centroids, gammas and cohesions of the O(n) possible mergers of the just created cluster with the remaining clusters. For each cluster, we also update the sorted list of merge candidates by deleting the two just merged clusters and inserting its cohesion with the just created cluster. Each iteration thus takes O(n log n). Overall time complexity is then O(n^2 log n).
Note that computing cluster centroids, gammas and cohesions will take time linear in the number of unique terms T in the document collection in the worst case. In contrast, complete-link clustering is bounded by the maximum document length D_max (if we compute similarity using the cosine for data points represented as vectors). So complete-link clustering is O(n^2 log n D_max) whereas average-link clustering is O(n^2 log n T). This means that we usually need to employ some form of dimensionality reduction for efficient average-link clustering (Schütze and Silverstein, 1997).
Also, there are only O(n^2) pairwise similarities in complete-link clustering since the similarity between two clusters is ultimately the similarity between two vectors. In average-link clustering, every subset of vectors can have a different cohesion, so we cannot precompute all possible cluster-cluster similarities.
Acknowledgments. I would like to thank Mark Craven, Ido Dagan, Chris Manning, Ray Mooney, Fionn Murtagh, Jan Pedersen, Edie Rasmussen, Yiming Yang, and Jian Zhang for discussions and contributions to the above.