Thus far, we have focused on retrieving precisely the highest-scoring documents for a query. We now consider schemes by which we produce documents that are likely to be among the highest scoring documents for a query. In doing so, we hope to dramatically lower the cost of computing the documents we output, without materially altering the user's perceived relevance of the top results. Consequently, in most applications it suffices to retrieve documents whose scores are very close to those of the best. In the sections that follow we detail schemes that retrieve such documents while potentially avoiding computing scores for most of the documents in the collection.
Such inexact top- retrieval is not necessarily, from the user's perspective, a bad thing. The top documents by the cosine measure are in any case not necessarily the best for the query: cosine similarity is only a proxy for the user's perceived relevance. In Sections 7.1.2 -7.1.6 below, we give heuristics using which we are likely to retrieve documents with cosine scores close to those of the top documents. The principal cost in computing the output stems from computing cosine similarities between the query and a large number of documents. Having a large number of documents in contention also increases the selection cost in the final stage of culling the top documents from a heap. We now consider a series of ideas designed to eliminate a large number of documents without computing their cosine scores. The heuristics have the following two-step scheme: