next up previous contents index
Next: Centroid clustering Up: Hierarchical clustering Previous: Time complexity of HAC   Contents   Index

Group-average agglomerative clustering

Group-average agglomerative clustering or GAAC (see Figure 17.3 , (d)) evaluates cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents. GAAC is also called group-average clustering and average-link clustering . GAAC computes the average similarity SIM-GA of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average:
\mbox{{\sc sim-ga}}(\omega_i,\omega_j) =
...eq d_\mthatwask}
\vec{d}_\mthatwask \cdot \vec{d}_\nthatwasell
\end{displaymath} (203)

where $\vec{d}$ is the length-normalized vector of document $d$, $\cdot$ denotes the dot product, and $N_i$ and $N_j$ are the number of documents in $\omega_i$ and $\omega_j$, respectively.

The motivation for GAAC is that our goal in selecting two clusters $\omega_i$ and $\omega_j$ as the next merge in HAC is that the resulting merge cluster $\omega_k = \omega_i
\cup \omega_j$ should be coherent. To judge the coherence of $\omega_k$, we need to look at all document-document similarities within $\omega_k$, including those that occur within $\omega_i$ and those that occur within $\omega_j$.

We can compute the measure SIM-GA efficiently because the sum of individual vector similarities is equal to the similarities of their sums:

$\displaystyle \sum_{d_\mthatwask \in \omega_i} \sum_{d_\nthatwasell \in \omega_...
...{d}_\mthatwask) \cdot
(\sum_{d_\nthatwasell \in \omega_j} \vec{d}_\nthatwasell)$     (204)

With gatrick, we have:
$\displaystyle \mbox{{\sc sim-ga}}(\omega_i,\omega_j)$ $\textstyle =$ $\displaystyle \frac{1}{(N_i+N_j)(N_i+N_j-1)}[ (\sum_{d_\mthatwask \in
\omega_i \cup \omega_j} \vec{d}_\mthatwask)^2
- (N_i+N_j)]$ (205)

The term $(N_i+N_j)$ on the right is the sum of $N_i+N_j$ self-similarities of value $1.0$. With this trick we can compute cluster similarity in constant time (assuming we have available the two vector sums $\sum_{d_\mthatwask \in
\omega_i} \vec{d}_\mthatwask$ and $\sum_{d_\mthatwask \in
\omega_j} \vec{d}_\mthatwask)$ instead of in $\Theta(N_iN_j)$. This is important because we need to be able to compute the function SIM on lines 18 and 20 in EFFICIENTHAC (Figure 17.8 ) in constant time for efficient implementations of GAAC. Note that for two singleton clusters, Equation 205 is equivalent to the dot product.

Equation 204 relies on the distributivity of the dot product with respect to vector addition. Since this is crucial for the efficient computation of a GAAC clustering, the method cannot be easily applied to representations of documents that are not real-valued vectors. Also, Equation 204 only holds for the dot product. While many algorithms introduced in this book have near-equivalent descriptions in terms of dot product, cosine similarity and Euclidean distance (cf. simdisfigs), Equation 204 can only be expressed using the dot product. This is a fundamental difference between single-link/complete-link clustering and GAAC. The first two only require a square matrix of similarities as input and do not care how these similarities were computed.

To summarize, GAAC requires (i) documents represented as vectors, (ii) length normalization of vectors, so that self-similarities are 1.0, and (iii) the dot product as the measure of similarity between vectors and sums of vectors.

The merge algorithms for GAAC and complete-link clustering are the same except that we use Equation 205 as similarity function in Figure 17.8 . Therefore, the overall time complexity of GAAC is the same as for complete-link clustering: $
\Theta(N^2 \log N)$. Like complete-link clustering, GAAC is not best-merge persistent (Exercise 17.10 ). This means that there is no $\Theta(N^2)$ algorithm for GAAC that would be analogous to the $\Theta(N^2)$ algorithm for single-link in Figure 17.9 .

We can also define group-average similarity as including self-similarities:

$\displaystyle \mbox{{\sc sim-ga}}^{\prime}(\omega_i,\omega_j) = \frac{1}{(N_i \...
..._j} \!\!\![\vec{d}_\mthatwask \cdot \vec{\mu} ( \omega_i \! \cup \! \omega_j )]$     (206)

where the centroid $\vec{\mu}(\omega)$ is defined as in Equation 139 (page 139 ). This definition is equivalent to the intuitive definition of cluster quality as average similarity of documents $\vec{d}_\mthatwask$ to the cluster's centroid $\vec{\mu}$.

Self-similarities are always equal to 1.0, the maximum possible value for length-normalized vectors. The proportion of self-similarities in Equation 206 is $i/i^2 = 1/i$ for a cluster of size $i$. This gives an unfair advantage to small clusters since they will have proportionally more self-similarities. For two documents $d_1$, $d_2$ with a similarity $s$, we have $\mbox{{\sc sim-ga}}^{\prime}(d_1,d_2) =
(1+s)/2$. In contrast, $\mbox{{\sc sim-ga}}(d_1,d_2)= s \leq (1+s)/2$. This similarity $\mbox{{\sc sim-ga}}(d_1,d_2)$ of two documents is the same as in single-link, complete-link and centroid clustering. We prefer the definition in Equation 205, which excludes self-similarities from the average, because we do not want to penalize large clusters for their smaller proportion of self-similarities and because we want a consistent similarity value $s$ for document pairs in all four HAC algorithms.


next up previous contents index
Next: Centroid clustering Up: Hierarchical clustering Previous: Time complexity of HAC   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.