Chapter 7 develops the computational aspects of
vector space scoring. Luhn (1957;1958) describes some of
the earliest reported applications of term weighting. His
paper dwells on the importance of medium frequency terms
(terms that are neither too commonplace nor too rare) and
may be thought of as anticipating tf-idf and related
weighting schemes. Spärck Jones (1972) builds on this intuition
through detailed experiments showing the use of inverse
document frequency in term weighting. A series of extensions
and theoretical justifications of idf are due to
Salton and Buckley (1987)
Robertson and Jones (1976), Croft and Harper (1979) and
Papineni (2001). Robertson maintains a web page
(`http://www.soi.city.ac.uk/ ~ser/idf.html`)
containing the history of idf, including soft copies of
early papers that predated electronic versions of journal
article. Singhal et al. (1996a) develop pivoted document
length normalization. Probabilistic language models
(Chapter 11 ) develop weighting techniques that are
more nuanced than tf-idf; the reader will find this
development in Section 11.4.3 .

We observed that by assigning a weight for each term in a document, a document may be viewed as a vector of term weights, one for each term in the collection. The SMART information retrieval system at Cornell (Salton, 1971b) due to Salton and colleagues was perhaps the first to view a document as a vector of weights. The basic computation of cosine scores as described in Section 6.3.3 is due to Zobel and Moffat (2006). The two query evaluation strategies term-at-a-time and document-at-a-time are discussed by Turtle and Flood (1995).

The SMART notation for tf-idf term weighting schemes in Figure 6.15 is presented in (Singhal et al., 1996b;1995, Salton and Buckley, 1988). Not all versions of the notation are consistent; we most closely follow (Singhal et al., 1996b). A more detailed and exhaustive notation was developed in Moffat and Zobel (1998), considering a larger palette of schemes for term and document frequency weighting. Beyond the notation, Moffat and Zobel (1998) sought to set up a space of feasible weighting functions through which hill-climbing approaches could be used to begin with weighting schemes that performed well, then make local improvements to identify the best combinations. However, they report that such hill-climbing methods failed to lead to any conclusions on the best weighting schemes.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07