Figure 6.14 gives the basic algorithm for computing vector space scores. The array Length holds the lengths (normalization factors) for each of the documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the
documents with the highest scores.
The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term in turn. In Step 5 we calculate the weight in the query vector for term
. Steps 6-8 update the score of each document by adding in the contribution from term
. This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the
elements of the array
are therefore known as accumulators . For this purpose, it would appear necessary to store, with each postings entry, the weight
of term
in document
(we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4 ). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alleviate this space problem. First, if we are using inverse document frequency , we need not precompute
; it suffices to store
at the head of the postings for
. Second, we store the term frequency
for each postings entry. Finally, Step 12 extracts the top
scores - this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than
comparisons to construct, following which each of the
top scores can be extracted from the heap at a cost of
comparisons.
Note that the general algorithm of Figure 6.14 does not prescribe a specific implementation of how we traverse the postings lists of the various query terms; we may traverse them one term at a time as in the loop beginning at Step 3, or we could in fact traverse them concurrently as in Figure 1.6 . In such a concurrent postings traversal we compute the scores of one document at a time, so that it is sometimes called document-at-a-time scoring. We will say more about this in Section 7.1.5 .
Exercises.