Computing vector scores

Figure 6.14 gives the basic algorithm for computing vector space scores. The array Length holds the lengths (normalization factors) for each of the documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the documents with the highest scores.

The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term in turn. In Step 5 we calculate the weight in the query vector for term . Steps 6-8 update the score of each document by adding in the contribution from term . This process of adding in contributions one query term at a time is sometimes known as *term-at-a-time* scoring or accumulation, and the elements of the array are therefore known as *accumulators* . For this purpose, it would appear necessary to store, with each postings entry, the weight
of term in document (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4 ). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alleviate this space problem. First, if we are using *inverse document frequency* , we need not precompute ; it suffices to store at the head of the postings for . Second, we store the term frequency
for each postings entry. Finally, Step 12 extracts the top scores - this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than comparisons to construct, following which each of the top scores can be extracted from the heap at a cost of comparisons.

Note that the general algorithm of Figure 6.14 does not prescribe a specific implementation of how we traverse the postings lists of the various query terms; we may traverse them one term at a time as in the loop beginning at Step 3, or we could in fact traverse them concurrently as in Figure 1.6 .
In such a concurrent postings traversal we compute the scores of one document at a time, so that it is sometimes called *document-at-a-time* scoring. We will say more about this in Section 7.1.5 .

**Exercises.**

- If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should be modified.
- Recall the tf-idf weights computed in Exercise 6.2.2. Compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms.
- Verify that the sum of the squares of the components of each of the document vectors in Exercise 6.3.3 is 1 (to within rounding error). Why is this the case?
- With term weights as computed in Exercise 6.3.3, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query:
- The weight of a term is 1 if present in the query, 0 otherwise.
- Euclidean normalized idf.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07