Berkhin (2006b) gives a general up-to-date survey of
clustering methods with special attention to scalability.
The classic reference for clustering in pattern recognition,
covering both -means and EM, is
(Duda et al., 2000). Rasmussen (1992) introduces
clustering from an information retrieval perspective.
Anderberg (1973) provides a general introduction to
clustering for applications. In addition to Euclidean
distance and cosine similarity ,
Kullback-Leibler divergence is often used in
clustering as a
measure of how (dis)similar documents and clusters are
(Xu and Croft, 1999, Muresan and Harper, 2004, Kurland and Lee, 2004).
The cluster hypothesis is due to Jardine and van Rijsbergen (1971) who state it as follows: Associations between documents convey information about the relevance of documents to requests. Croft (1978), Can and Ozkarahan (1990), Voorhees (1985a), Salton (1975), Cacheda et al. (2003), Salton (1971a), Singitham et al. (2004), Can et al. (2004) and Altingövde et al. (2008) investigate the efficiency and effectiveness of cluster-based retrieval. While some of these studies show improvements in effectiveness, efficiency or both, there is no consensus that cluster-based retrieval works well consistently across scenarios. Cluster-based language modeling was pioneered by Liu and Croft (2004).
There is good evidence that clustering of search results
improves user experience and search result quality
(Hearst and Pedersen, 1996, Zamir and Etzioni, 1999, Käki, 2005, Toda and Kataoka, 2005, Tombros et al., 2002), although not as much as search result
structuring based on carefully edited category hierarchies
(Hearst, 2006). The Scatter-Gather interface
for browsing collections was presented by
Cutting et al. (1992). A theoretical framework for
analyzing the properties of Scatter/Gather and other
information seeking user interfaces is presented by
Pirolli (2007). Schütze and Silverstein (1997) evaluate LSI (Chapter 18 ) and
truncated representations of centroids for efficient -means
clustering.
The Columbia NewsBlaster system (McKeown et al., 2002), a forerunner to the now much more famous and refined Google News (http://news.google.com), used hierarchical clustering (Chapter 17 ) to give two levels of news topic granularity. See Hatzivassiloglou et al. (2000) for details, and Chen and Lin (2000) and Radev et al. (2001) for related systems. Other applications of clustering in information retrieval are duplicate detection (Yang and Callan (2006), shingling), novelty detection (see references in hclstfurther) and metadata discovery on the semantic web (Alonso et al., 2006).
The discussion of external evaluation measures is partially
based on Strehl (2002).
Dom (2002) proposes a measure that is
better motivated theoretically than NMI.
is the number
of bits needed to transmit class memberships assuming
cluster memberships are known. The Rand index is due to
Rand (1971). Hubert and Arabie (1985) propose an
adjusted that ranges between
and 1 and is 0 if there
is only chance agreement between clusters and classes
(similar to
in Chapter 8 ,
page 8.2 ). Basu et al. (2004) argue that the
three evaluation measures NMI, Rand index and F measure give
very similar results. Stein et al. (2003) propose
expected edge density as
an internal measure and give evidence that it is a good
predictor of the quality of a clustering.
Kleinberg (2002) and Meila (2005) present axiomatic frameworks for comparing clusterings.
Authors that are often credited with the invention of the
-means algorithm include Lloyd (1982) (first
distributed in 1957), Ball (1965),
MacQueen (1967), and Hartigan and Wong (1979).
Arthur and Vassilvitskii (2006) investigate the worst-case
complexity of
-means.
Bradley and Fayyad (1998), Pelleg and Moore (1999) and
Davidson and Satyanarayana (2003) investigate the convergence
properties
of
-means empirically and how it depends on
initial seed selection.
Dhillon and Modha (2001) compare
-means clusters with SVD -based clusters (Chapter 18 ). The
K-medoid algorithm was presented by Kaufman and Rousseeuw (1990).
The EM algorithm was originally introduced by Dempster et al. (1977).
An in-depth treatment of EM is (McLachlan and Krishnan, 1996). See
Section 18.5 (page
) for publications on latent analysis, which can
also be viewed as soft clustering.
AIC is due to Akaike (1974) (see also
Burnham and Anderson (2002)). An alternative to AIC is BIC, which
can be motivated as a Bayesian model selection procedure
(Schwarz, 1978). Fraley and Raftery (1998) show how to
choose an optimal number of clusters based on BIC. An
application of BIC to -means is (Pelleg and Moore, 2000).
Hamerly and Elkan (2003) propose an alternative to BIC that
performs better in their experiments. Another influential
Bayesian approach for determining the number of clusters
(simultaneously with cluster assignment) is described by
Cheeseman and Stutz (1996). Two methods for determining
cardinality without external criteria are presented by
Tibshirani et al. (2001).
We only have space here for classical completely unsupervised clustering. An important current topic of research is how to use prior knowledge to guide clustering (e.g., Ji and Xu (2006)) and how to incorporate interactive feedback during clustering (e.g., Huang and Mitchell (2006)). Fayyad et al. (1998) propose an initialization for EM clustering. For algorithms that can cluster very large data sets in one scan through the data see Bradley et al. (1998).
The applications in Table 16.1 all cluster documents. Other information retrieval applications cluster words (e.g., Crouch, 1988), contexts of words (e.g., Schütze and Pedersen, 1995) or words and documents simultaneously (e.g., Tishby and Slonim, 2000, Zha et al., 2001, Dhillon, 2001). Simultaneous clustering of words and documents is an example of co-clustering or biclustering .