next up previous contents index
Next: Variant tf-idf functions Up: The vector space model Previous: Queries as vectors   Contents   Index


Computing vector scores

In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer $K$. We seek the $K$ documents of the collection with the highest vector space scores on the given query. We now initiate the study of determining the $K$ documents with the highest vector space scores for a query. Typically, we seek these $K$ top documents in ordered by decreasing score; for instance many search engines use $K=10$ to retrieve and rank-order the first page of the ten best results. Here we give the basic algorithm for this computation; we develop a fuller treatment of efficient techniques and approximations in Chapter 7 .

Figure 6.14: The basic algorithm for computing vector space scores.
\begin{figure}\begin{algorithm}{CosineScore}{q}
\text{float} Scores[N] = 0\\
\t...
...RETURN{\mbox{Top }K \mbox{ components of }Scores[]}
\end{algorithm}
\end{figure}

Figure 6.14 gives the basic algorithm for computing vector space scores. The array Length holds the lengths (normalization factors) for each of the $N$ documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the $K$ documents with the highest scores.

The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term $t$ in turn. In Step 5 we calculate the weight in the query vector for term $t$. Steps 6-8 update the score of each document by adding in the contribution from term $t$. This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the $N$ elements of the array $Scores$ are therefore known as accumulators . For this purpose, it would appear necessary to store, with each postings entry, the weight $\mbox{wf}_{t,d}$ of term $t$ in document $d$ (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4 ). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alleviate this space problem. First, if we are using inverse document frequency , we need not precompute $\mbox{idf}_t$; it suffices to store $N/\mbox{df}_t$ at the head of the postings for $t$. Second, we store the term frequency $\mbox{tf}_{t,d}$ for each postings entry. Finally, Step 12 extracts the top $K$ scores - this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than $2N$ comparisons to construct, following which each of the $K$ top scores can be extracted from the heap at a cost of $O(\log N)$ comparisons.

Note that the general algorithm of Figure 6.14 does not prescribe a specific implementation of how we traverse the postings lists of the various query terms; we may traverse them one term at a time as in the loop beginning at Step 3, or we could in fact traverse them concurrently as in Figure 1.6 . In such a concurrent postings traversal we compute the scores of one document at a time, so that it is sometimes called document-at-a-time scoring. We will say more about this in Section 7.1.5 .

Exercises.


next up previous contents index
Next: Variant tf-idf functions Up: The vector space model Previous: Queries as vectors   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07