next up previous contents index
Next: Index elimination Up: Efficient scoring and ranking Previous: Efficient scoring and ranking   Contents   Index


Inexact top K document retrieval

Thus far, we have focused on retrieving precisely the $K$ highest-scoring documents for a query. We now consider schemes by which we produce $K$ documents that are likely to be among the $K$ highest scoring documents for a query. In doing so, we hope to dramatically lower the cost of computing the $K$ documents we output, without materially altering the user's perceived relevance of the top $K$ results. Consequently, in most applications it suffices to retrieve $K$ documents whose scores are very close to those of the $K$ best. In the sections that follow we detail schemes that retrieve $K$ such documents while potentially avoiding computing scores for most of the $N$ documents in the collection.

Such inexact top-$K$ retrieval is not necessarily, from the user's perspective, a bad thing. The top $K$ documents by the cosine measure are in any case not necessarily the $K$ best for the query: cosine similarity is only a proxy for the user's perceived relevance. In Sections 7.1.2 -7.1.6 below, we give heuristics using which we are likely to retrieve $K$ documents with cosine scores close to those of the top $K$ documents. The principal cost in computing the output stems from computing cosine similarities between the query and a large number of documents. Having a large number of documents in contention also increases the selection cost in the final stage of culling the top $K$ documents from a heap. We now consider a series of ideas designed to eliminate a large number of documents without computing their cosine scores. The heuristics have the following two-step scheme:

  1. Find a set $A$ of documents that are contenders, where $K < \vert A\vert\ll N$. $A$ does not necessarily contain the $K$ top-scoring documents for the query, but is likely to have many documents with scores near those of the top $K$.
  2. Return the $K$ top-scoring documents in $A$.
From the descriptions of these ideas it will be clear that many of them require parameters to be tuned to the collection and application at hand; pointers to experience in setting these parameters may be found at the end of this chapter. It should also be noted that most of these heuristics are well-suited to free text queries, but not for Boolean or phrase queries.


next up previous contents index
Next: Index elimination Up: Efficient scoring and ranking Previous: Efficient scoring and ranking   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07