Next: Document and query weighting Up: Variant tf-idf functions Previous: Sublinear tf scaling   Contents   Index

## Maximum tf normalization

One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document , let , where ranges over all terms in . Then, we compute a normalized term frequency for each term in document by
 (30)

where is a value between and and is generally set to , although some early work used the value . The term in (30) is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in . We will encounter smoothing further in Chapter 13 when discussing classification; the basic idea is to avoid a large swing in from modest changes in (say from 1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document and create a new document by simply appending a copy of to itself. While should be no more relevant to any query than is, the use of (23) would assign it twice as high a score as . Replacing in (23) by eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues:
1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
3. More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.

Next: Document and query weighting Up: Variant tf-idf functions Previous: Sublinear tf scaling   Contents   Index