The complexity of the naive HAC algorithm in Figure 17.2 is because we exhaustively scan the matrix for the largest similarity in each of iterations.

For the four HAC methods discussed in this chapter a more efficient algorithm is the priority-queue algorithm shown in Figure 17.8 . Its time complexity is . The rows of the similarity matrix are sorted in decreasing order of similarity in the priority queues . then returns the cluster in that currently has the highest similarity with , where we use to denote the cluster as in Chapter 16 . After creating the merged cluster of and , is used as its representative. The function SIM computes the similarity function for potential merge pairs: largest similarity for single-link, smallest similarity for complete-link, average similarity for GAAC (Section 17.3 ), and centroid similarity for centroid clustering (Section 17.4 ). We give an example of how a row of is processed (Figure 17.8 , bottom panel). The loop in lines 1-7 is and the loop in lines 9-21 is for an implementation of priority queues that supports deletion and insertion in . The overall complexity of the algorithm is therefore . In the definition of the function SIM, and are the vector sums of and , respectively, and and are the number of documents in and , respectively.

The argument of EFFICIENTHAC in Figure 17.8 is a set of vectors (as opposed to a set of generic documents) because GAAC and centroid clustering ( and 17.4 ) require vectors as input. The complete-link version of EFFICIENTHAC can also be applied to documents that are not represented as vectors.

For single-link, we can introduce a next-best-merge array (NBM) as a further optimization as shown in Figure 17.9 . NBM keeps track of what the best merge is for each cluster. Each of the two top level for-loops in Figure 17.9 are , thus the overall complexity of single-link clustering is .

Can we also
speed up the other three HAC algorithms
with an NBM array?
We cannot because only single-link clustering is
*best-merge persistent* . Suppose that
the best merge cluster for is in
single-link clustering. Then after merging
with a third cluster
, the merge of and will
be 's best merge cluster
(Exercise 17.10 ). In other words,
the best-merge candidate for the merged cluster is
one of the two best-merge candidates of its components in
single-link clustering. This
means that
can be updated in in each iteration - by taking a
simple max of two values
on line 14 in
Figure 17.9 for each of the remaining clusters.

Figure 17.10 demonstrates that best-merge persistence does not hold for complete-link clustering, which means that we cannot use an NBM array to speed up clustering. After merging 's best merge candidate with cluster , an unrelated cluster becomes the best merge candidate for . This is because the complete-link merge criterion is non-local and can be affected by points at a great distance from the area where two merge candidates meet.

In practice, the efficiency penalty of the algorithm is small compared with the single-link algorithm since computing the similarity between two documents (e.g., as a dot product) is an order of magnitude slower than comparing two scalars in sorting. All four HAC algorithms in this chapter are with respect to similarity computations. So the difference in complexity is rarely a concern in practice when choosing one of the algorithms.

**Exercises.**

- Show that complete-link clustering creates the two-cluster
clustering depicted in
Figure 17.7 .

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07