A single-link clustering can also be computed from the
minimum spanning tree of a graph. The minimum spanning
tree connects the vertices of a graph
at the smallest possible cost, where cost is defined as the sum
over all edges of the graph. In our case the cost of an edge
is the distance between two documents.
Show that if
are the
costs of the edges of a minimum spanning tree, then these
edges correspond to the merges in constructing a
single-link clustering.
Show that
single-link clustering is best-merge persistent and that
GAAC and centroid clustering are not best-merge persistent.
Consider running 2-means clustering on a
collection with documents from two different languages.
What result would you expect?
Would you expect the same
result when running an HAC algorithm?
Download
Reuters-21578. Keep only documents that are in the classes
crude,
interest, and
grain. Discard documents that are members of
more than one of these three classes. Compute a (i)
single-link, (ii) complete-link, (iii) GAAC, (iv) centroid
clustering of the documents. (v) Cut each dendrogram at the
second branch from the top to obtain clusters.
Compute the Rand index for each of the 4 clusterings. Which
clustering method performs best?
Suppose a run of HAC finds
the clustering with to have
the highest value on some prechosen goodness measure of clustering. Have we found the
highest-value clustering among all clusterings with ?
Consider the task of producing a single-link clustering of points on a line:
Show that we only need to compute a total of about
similarities.
What is the overall complexity of
single-link clustering for a set of points on a line?
Prove that
single-link, complete-link, and
group-average clustering are monotonic in the sense defined on page 17.1 .
For points,
there are different flat clusterings into
clusters (Section 16.2 ,
page 16.2.1 ). What is the number of
different hierarchical clusterings (or
dendrograms) of documents?
Are there more flat clusterings or more hierarchical
clusterings for given and ?