![]() |
(203) |
The motivation for GAAC is that our goal in selecting two
clusters and
as the next merge in HAC
is that the resulting merge cluster
should be
coherent. To judge the coherence of
,
we need to look at all document-document similarities within
,
including those that occur within
and those that
occur within
.
We can compute
the measure SIM-GA
efficiently because the
sum of individual vector similarities is equal to the
similarities of their sums:
Equation 204 relies on the distributivity of the dot product with respect to vector addition. Since this is crucial for the efficient computation of a GAAC clustering, the method cannot be easily applied to representations of documents that are not real-valued vectors. Also, Equation 204 only holds for the dot product. While many algorithms introduced in this book have near-equivalent descriptions in terms of dot product, cosine similarity and Euclidean distance (cf. simdisfigs), Equation 204 can only be expressed using the dot product. This is a fundamental difference between single-link/complete-link clustering and GAAC. The first two only require a square matrix of similarities as input and do not care how these similarities were computed.
To summarize, GAAC requires (i) documents represented as vectors, (ii) length normalization of vectors, so that self-similarities are 1.0, and (iii) the dot product as the measure of similarity between vectors and sums of vectors.
The merge algorithms for GAAC
and complete-link clustering are the same except that we
use
Equation 205
as
similarity function
in
Figure 17.8 . Therefore, the overall time complexity of
GAAC is the same as for complete-link
clustering:
.
Like complete-link clustering, GAAC is
not best-merge persistent
(Exercise 17.10 ).
This means that
there
is no
algorithm for GAAC
that would be analogous to the
algorithm
for single-link in Figure 17.9 .
We can also define group-average similarity
as including self-similarities:
Self-similarities are always
equal to 1.0, the maximum possible value for length-normalized vectors.
The
proportion of self-similarities in
Equation 206 is for a cluster of size
.
This gives an unfair advantage to small clusters since they
will have proportionally more self-similarities.
For two documents
,
with a similarity
,
we have
. In contrast,
. This
similarity
of two documents
is
the same as in single-link,
complete-link and centroid
clustering. We prefer the definition in
Equation 205, which excludes self-similarities
from the average, because we do not want to penalize large
clusters for their smaller proportion of self-similarities
and because we want a consistent similarity value
for
document pairs in all four HAC algorithms.
Exercises.