Cluster pruning

Next: Components of an information Up: Efficient scoring and ranking Previous: Impact ordering Contents Index

Cluster pruning

In cluster pruning we have a preprocessing step during which we cluster the document vectors. Then at query time, we consider only documents in a small number of clusters as candidates for which we compute cosine scores. Specifically, the preprocessing step is as follows:

Pick $\sqrt{N}$ documents at random from the collection. Call these leaders.
For each document that is not a leader, we compute its nearest leader.

We refer to documents that are not leaders as followers. Intuitively, in the partition of the followers induced by the use of $\sqrt{N}$ randomly chosen leaders, the expected number of followers for each leader is $\approx N/\sqrt{N} = \sqrt{N}$ . Next, query processing proceeds as follows:

Given a query , find the leader that is closest to . This entails computing cosine similarities from to each of the $\sqrt{N}$ leaders.
The candidate set consists of together with its followers. We compute the cosine scores for all documents in this candidate set.

The use of randomly chosen leaders for clustering is fast and likely to reflect the distribution of the document vectors in the vector space: a region of the vector space that is dense in documents is likely to produce multiple leaders and thus a finer partition into sub-regions. This illustrated in Figure 7.3 .

**Figure 7.3:** Cluster pruning.
$\includegraphics[width=8cm]{clusterpruning.eps}$

Variations of cluster pruning introduce additional parameters and , both of which are positive integers. In the pre-processing step we attach each follower to its closest leaders, rather than a single closest leader. At query time we consider the leaders closest to the query . Clearly, the basic scheme above corresponds to the case . Further, increasing or increases the likelihood of finding documents that are more likely to be in the set of true top-scoring documents, at the expense of more computation. We reiterate this approach when describing clustering in Chapter 16 (page 16.1 ).

Exercises.

We suggested above (Figure 7.2 ) that the postings for static quality ordering be in decreasing order of . Why do we use the decreasing rather than the increasing order?
When discussing champion lists, we simply used the documents with the largest tf values to create the champion list for . But when considering global champion lists, we used idf as well, identifying documents with the largest values of $g(d)+\mbox{tf-idf}_{t,d}$ . Why do we differentiate between these two cases?
If we were to only have one-term queries, explain why the use of global champion lists with suffices for identifying the highest scoring documents. What is a simple modification to this idea if we were to only have -term queries for any fixed integer ?
Explain how the common global ordering by values in all high and low lists helps make the score computation efficient.
Consider again the data of Exercise 6.4.4 with nnn.atc for the query-dependent scoring. Suppose that we were given static quality scores of 1 for Doc1 and 2 for Doc2. Determine under Equation 35 what ranges of static quality score for Doc3 result in it being the first, second or third result for the query best car insurance.
Sketch the frequency-ordered postings for the data in Figure 6.9 .
Let the static quality scores for Doc1, Doc2 and Doc3 in Figure 6.11 be respectively 0.25, 0.5 and 1. Sketch the postings for impact ordering when each postings list is ordered by the sum of the static quality score and the Euclidean normalized tf values in Figure 6.11 .
The nearest-neighbor problem in the plane is the following: given a set of data points on the plane, we preprocess them into some data structure such that, given a query point , we seek the point in that is closest to in Euclidean distance. Clearly cluster pruning can be used as an approach to the nearest-neighbor problem in the plane, if we wished to avoid computing the distance from to every one of the query points. Devise a simple example on the plane so that with two leaders, the answer returned by cluster pruning is incorrect (it is not the data point closest to ).

Next: Components of an information Up: Efficient scoring and ranking Previous: Impact ordering Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07