Group-average agglomerative clustering

(203) |

The motivation for GAAC is that our goal in selecting two clusters and as the next merge in HAC is that the resulting merge cluster should be coherent. To judge the coherence of , we need to look at all document-document similarities within , including those that occur within and those that occur within .

We can compute
the measure SIM-GA
efficiently because the
sum of individual vector similarities is equal to the
similarities of their sums:

With gatrick, we have:

The term on the right is the sum of self-similarities of value . With this trick we can compute cluster similarity in constant time (assuming we have available the two vector sums and instead of in . This is important because we need to be able to compute the function SIM on lines 18 and 20 in EFFICIENTHAC (Figure 17.8 ) in constant time for efficient implementations of GAAC. Note that for two singleton clusters, Equation 205 is equivalent to the dot product.

Equation 204 relies on the distributivity of the dot product with respect to vector addition. Since this is crucial for the efficient computation of a GAAC clustering, the method cannot be easily applied to representations of documents that are not real-valued vectors. Also, Equation 204 only holds for the dot product. While many algorithms introduced in this book have near-equivalent descriptions in terms of dot product, cosine similarity and Euclidean distance (cf. simdisfigs), Equation 204 can only be expressed using the dot product. This is a fundamental difference between single-link/complete-link clustering and GAAC. The first two only require a square matrix of similarities as input and do not care how these similarities were computed.

To summarize, GAAC requires (i) documents represented as vectors, (ii) length normalization of vectors, so that self-similarities are 1.0, and (iii) the dot product as the measure of similarity between vectors and sums of vectors.

The merge algorithms for GAAC and complete-link clustering are the same except that we use Equation 205 as similarity function in Figure 17.8 . Therefore, the overall time complexity of GAAC is the same as for complete-link clustering: . Like complete-link clustering, GAAC is not best-merge persistent (Exercise 17.10 ). This means that there is no algorithm for GAAC that would be analogous to the algorithm for single-link in Figure 17.9 .

We can also define group-average similarity
as including self-similarities:

where the centroid is defined as in Equation 139 (page 139 ). This definition is equivalent to the intuitive definition of cluster quality as average similarity of documents to the cluster's centroid .

Self-similarities are always equal to 1.0, the maximum possible value for length-normalized vectors. The proportion of self-similarities in Equation 206 is for a cluster of size . This gives an unfair advantage to small clusters since they will have proportionally more self-similarities. For two documents , with a similarity , we have . In contrast, . This similarity of two documents is the same as in single-link, complete-link and centroid clustering. We prefer the definition in Equation 205, which excludes self-similarities from the average, because we do not want to penalize large clusters for their smaller proportion of self-similarities and because we want a consistent similarity value for document pairs in all four HAC algorithms.

**Exercises.**

- Apply group-average clustering to the points in
and 17.7 . Map them onto the surface of the unit sphere in
a three-dimensional space to get length-normalized vectors.
Is the group-average clustering different from the single-link and
complete-link clusterings?

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07