Inexact top K document retrieval

Next: Index elimination Up: Efficient scoring and ranking Previous: Efficient scoring and ranking Contents Index

Inexact top K document retrieval

Thus far, we have focused on retrieving precisely the highest-scoring documents for a query. We now consider schemes by which we produce documents that are likely to be among the highest scoring documents for a query. In doing so, we hope to dramatically lower the cost of computing the documents we output, without materially altering the user's perceived relevance of the top results. Consequently, in most applications it suffices to retrieve documents whose scores are very close to those of the best. In the sections that follow we detail schemes that retrieve such documents while potentially avoiding computing scores for most of the documents in the collection.

Such inexact top- retrieval is not necessarily, from the user's perspective, a bad thing. The top documents by the cosine measure are in any case not necessarily the best for the query: cosine similarity is only a proxy for the user's perceived relevance. In Sections 7.1.2 -7.1.6 below, we give heuristics using which we are likely to retrieve documents with cosine scores close to those of the top documents. The principal cost in computing the output stems from computing cosine similarities between the query and a large number of documents. Having a large number of documents in contention also increases the selection cost in the final stage of culling the top documents from a heap. We now consider a series of ideas designed to eliminate a large number of documents without computing their cosine scores. The heuristics have the following two-step scheme:

Find a set of documents that are contenders, where $K < \vert A\vert\ll N$ . does not necessarily contain the top-scoring documents for the query, but is likely to have many documents with scores near those of the top .
Return the top-scoring documents in .

From the descriptions of these ideas it will be clear that many of them require parameters to be tuned to the collection and application at hand; pointers to experience in setting these parameters may be found at the end of this chapter. It should also be noted that most of these heuristics are well-suited to free text queries, but not for Boolean or phrase queries.

Next: Index elimination Up: Efficient scoring and ranking Previous: Efficient scoring and ranking Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07