Thus far, we have focused on retrieving precisely the highest-scoring documents for a query. We now consider schemes by which we produce
documents that are likely to be among the
highest scoring documents for a query. In doing so, we hope to dramatically lower the cost of computing the
documents we output, without materially altering the user's perceived relevance of the top
results. Consequently, in most applications it suffices to retrieve
documents whose scores are very close to those of the
best. In the sections that follow we detail schemes that retrieve
such documents while potentially avoiding computing scores for most of the
documents in the collection.
Such inexact top- retrieval is not necessarily, from the user's perspective, a bad thing. The top
documents by the cosine measure are in any case not necessarily the
best for the query: cosine similarity is only a proxy for the user's perceived relevance. In Sections 7.1.2 -7.1.6 below, we give heuristics using which we are likely to retrieve
documents with cosine scores close to those of the top
documents. The principal cost in computing the output stems from computing cosine similarities between the query and a large number of documents. Having a large number of documents in contention also increases the selection cost in the final stage of culling the top
documents from a heap. We now consider a series of ideas designed to eliminate a large number of documents without computing their cosine scores. The heuristics have the following two-step scheme: