Berkhin (2006b) gives a general up-to-date survey of clustering methods with special attention to scalability. The classic reference for clustering in pattern recognition, covering both -means and EM, is (Duda et al., 2000). Rasmussen (1992) introduces clustering from an information retrieval perspective. Anderberg (1973) provides a general introduction to clustering for applications. In addition to Euclidean distance and cosine similarity , Kullback-Leibler divergence is often used in clustering as a measure of how (dis)similar documents and clusters are (Xu and Croft, 1999, Muresan and Harper, 2004, Kurland and Lee, 2004).
The cluster hypothesis is due to Jardine and van Rijsbergen (1971) who state it as follows: Associations between documents convey information about the relevance of documents to requests. Croft (1978), Can and Ozkarahan (1990), Voorhees (1985a), Salton (1975), Cacheda et al. (2003), Salton (1971a), Singitham et al. (2004), Can et al. (2004) and Altingövde et al. (2008) investigate the efficiency and effectiveness of cluster-based retrieval. While some of these studies show improvements in effectiveness, efficiency or both, there is no consensus that cluster-based retrieval works well consistently across scenarios. Cluster-based language modeling was pioneered by Liu and Croft (2004).
There is good evidence that clustering of search results improves user experience and search result quality (Hearst and Pedersen, 1996, Zamir and Etzioni, 1999, Käki, 2005, Toda and Kataoka, 2005, Tombros et al., 2002), although not as much as search result structuring based on carefully edited category hierarchies (Hearst, 2006). The Scatter-Gather interface for browsing collections was presented by Cutting et al. (1992). A theoretical framework for analyzing the properties of Scatter/Gather and other information seeking user interfaces is presented by Pirolli (2007). Schütze and Silverstein (1997) evaluate LSI (Chapter 18 ) and truncated representations of centroids for efficient -means clustering.
The Columbia NewsBlaster system (McKeown et al., 2002), a forerunner to the now much more famous and refined Google News (http://news.google.com), used hierarchical clustering (Chapter 17 ) to give two levels of news topic granularity. See Hatzivassiloglou et al. (2000) for details, and Chen and Lin (2000) and Radev et al. (2001) for related systems. Other applications of clustering in information retrieval are duplicate detection (Yang and Callan (2006), shingling), novelty detection (see references in hclstfurther) and metadata discovery on the semantic web (Alonso et al., 2006).
The discussion of external evaluation measures is partially based on Strehl (2002). Dom (2002) proposes a measure that is better motivated theoretically than NMI. is the number of bits needed to transmit class memberships assuming cluster memberships are known. The Rand index is due to Rand (1971). Hubert and Arabie (1985) propose an adjusted that ranges between and 1 and is 0 if there is only chance agreement between clusters and classes (similar to in Chapter 8 , page 8.2 ). Basu et al. (2004) argue that the three evaluation measures NMI, Rand index and F measure give very similar results. Stein et al. (2003) propose expected edge density as an internal measure and give evidence that it is a good predictor of the quality of a clustering.
Kleinberg (2002) and Meila (2005) present axiomatic frameworks for comparing clusterings.
Authors that are often credited with the invention of the -means algorithm include Lloyd (1982) (first distributed in 1957), Ball (1965), MacQueen (1967), and Hartigan and Wong (1979). Arthur and Vassilvitskii (2006) investigate the worst-case complexity of -means. Bradley and Fayyad (1998), Pelleg and Moore (1999) and Davidson and Satyanarayana (2003) investigate the convergence properties of -means empirically and how it depends on initial seed selection. Dhillon and Modha (2001) compare -means clusters with SVD -based clusters (Chapter 18 ). The K-medoid algorithm was presented by Kaufman and Rousseeuw (1990). The EM algorithm was originally introduced by Dempster et al. (1977). An in-depth treatment of EM is (McLachlan and Krishnan, 1996). See Section 18.5 (page ) for publications on latent analysis, which can also be viewed as soft clustering.
AIC is due to Akaike (1974) (see also Burnham and Anderson (2002)). An alternative to AIC is BIC, which can be motivated as a Bayesian model selection procedure (Schwarz, 1978). Fraley and Raftery (1998) show how to choose an optimal number of clusters based on BIC. An application of BIC to -means is (Pelleg and Moore, 2000). Hamerly and Elkan (2003) propose an alternative to BIC that performs better in their experiments. Another influential Bayesian approach for determining the number of clusters (simultaneously with cluster assignment) is described by Cheeseman and Stutz (1996). Two methods for determining cardinality without external criteria are presented by Tibshirani et al. (2001).
We only have space here for classical completely unsupervised clustering. An important current topic of research is how to use prior knowledge to guide clustering (e.g., Ji and Xu (2006)) and how to incorporate interactive feedback during clustering (e.g., Huang and Mitchell (2006)). Fayyad et al. (1998) propose an initialization for EM clustering. For algorithms that can cluster very large data sets in one scan through the data see Bradley et al. (1998).
The applications in Table 16.1 all cluster documents. Other information retrieval applications cluster words (e.g., Crouch, 1988), contexts of words (e.g., Schütze and Pedersen, 1995) or words and documents simultaneously (e.g., Tishby and Slonim, 2000, Zha et al., 2001, Dhillon, 2001). Simultaneous clustering of words and documents is an example of co-clustering or biclustering .