Towards this end, we assign to each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. We would like to compute a score between a query term and a document , based on the weight of in . The simplest approach is to assign the weight to be equal to the number of occurrences of term in document . This weighting scheme is referred to as term frequency and is denoted , with the subscripts denoting the term and the document in order.
For a document , the set of weights determined by the weights above (or indeed any weighting function that maps the number of occurrences of in to a positive real value) may be viewed as a quantitative digest of that document. In this view of a document, known in the literature as the bag of words model , the exact ordering of the terms in a document is ignored but the number of occurrences of each term is material (in contrast to Boolean retrieval). We only retain information on the number of occurrences of each term. Thus, the document ``Mary is quicker than John'' is, in this view, identical to the document ``John is quicker than Mary''. Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content. We will develop this intuition further in Section 6.3 .
Before doing so we first study the question: are all words in a document equally important? Clearly not; in Section 2.2.2 (page ) we looked at the idea of stop words - words that we decide not to index at all, and therefore do not contribute in any way to retrieval and scoring.