The complexity of the naive HAC algorithm in Figure 17.2 is because we exhaustively scan the matrix for the largest similarity in each of iterations.
For the four HAC methods discussed in this chapter a more efficient algorithm is the priority-queue algorithm shown in Figure 17.8 . Its time complexity is . The rows of the similarity matrix are sorted in decreasing order of similarity in the priority queues . then returns the cluster in that currently has the highest similarity with , where we use to denote the cluster as in Chapter 16 . After creating the merged cluster of and , is used as its representative. The function SIM computes the similarity function for potential merge pairs: largest similarity for single-link, smallest similarity for complete-link, average similarity for GAAC (Section 17.3 ), and centroid similarity for centroid clustering (Section 17.4 ). We give an example of how a row of is processed (Figure 17.8 , bottom panel). The loop in lines 1-7 is and the loop in lines 9-21 is for an implementation of priority queues that supports deletion and insertion in . The overall complexity of the algorithm is therefore . In the definition of the function SIM, and are the vector sums of and , respectively, and and are the number of documents in and , respectively.
The argument of EFFICIENTHAC in Figure 17.8 is a set of vectors (as opposed to a set of generic documents) because GAAC and centroid clustering ( and 17.4 ) require vectors as input. The complete-link version of EFFICIENTHAC can also be applied to documents that are not represented as vectors.
For single-link, we can introduce a next-best-merge array (NBM) as a further optimization as shown in Figure 17.9 . NBM keeps track of what the best merge is for each cluster. Each of the two top level for-loops in Figure 17.9 are , thus the overall complexity of single-link clustering is .
Can we also speed up the other three HAC algorithms with an NBM array? We cannot because only single-link clustering is best-merge persistent . Suppose that the best merge cluster for is in single-link clustering. Then after merging with a third cluster , the merge of and will be 's best merge cluster (Exercise 17.10 ). In other words, the best-merge candidate for the merged cluster is one of the two best-merge candidates of its components in single-link clustering. This means that can be updated in in each iteration - by taking a simple max of two values on line 14 in Figure 17.9 for each of the remaining clusters.
Figure 17.10 demonstrates that best-merge persistence does not hold for complete-link clustering, which means that we cannot use an NBM array to speed up clustering. After merging 's best merge candidate with cluster , an unrelated cluster becomes the best merge candidate for . This is because the complete-link merge criterion is non-local and can be affected by points at a great distance from the area where two merge candidates meet.
In practice, the efficiency penalty of the algorithm is small compared with the single-link algorithm since computing the similarity between two documents (e.g., as a dot product) is an order of magnitude slower than comparing two scalars in sorting. All four HAC algorithms in this chapter are with respect to similarity computations. So the difference in complexity is rarely a concern in practice when choosing one of the algorithms.