next up previous contents index
Next: Impact ordering Up: Efficient scoring and ranking Previous: Champion lists   Contents   Index


Static quality scores and ordering

We now further develop the idea of champion lists, in the somewhat more general setting of static quality scores . In many search engines, we have available a measure of quality $g(d)$ for each document $d$ that is query-independent and thus static. This quality measure may be viewed as a number between zero and one. For instance, in the context of news stories on the web, $g(d)$ may be derived from the number of favorable reviews of the story by web surfers. otherindexing provides further discussion on this topic, as does Chapter 21 in the context of web search.

The net score for a document $d$ is some combination of $g(d)$ together with the query-dependent score induced (say) by (27). The precise combination may be determined by the learning methods of Section 6.1.2 , to be developed further in Section 15.4.1 ; but for the purposes of our exposition here, let us consider a simple sum:

\begin{displaymath}
\mbox{net-score}(q,d)= g(d)+\frac{\vec{V}(q)\cdot \vec{V}(d)}{\vert\vec{V}(q)\vert \vert\vec{V}(d)\vert}.
\end{displaymath} (35)

In this simple form, the static quality $g(d)$ and the query-dependent score from (24) have equal contributions, assuming each is between 0 and 1. Other relative weightings are possible; the effectiveness of our heuristics will depend on the specific relative weighting.

First, consider ordering the documents in the postings list for each term by decreasing value of $g(d)$. This allows us to perform the postings intersection algorithm of Figure 1.6 . In order to perform the intersection by a single pass through the postings of each query term, the algorithm of Figure 1.6 relied on the postings being ordered by document IDs. But in fact, we only required that all postings be ordered by a single common ordering; here we rely on the $g(d)$ values to provide this common ordering. This is illustrated in Figure 7.2 , where the postings are ordered in decreasing order of $g(d)$.

\includegraphics[width=13cm]{gd.eps} A static quality-ordered index.In this example we assume that Doc1, Doc2 and Doc3 respectively have static quality scores $g(1)=0.25, g(2)=0.5, g(3)=1$.

The first idea is a direct extension of champion lists: for a well-chosen value $r$, we maintain for each term $t$ a global champion list of the $r$ documents with the highest values for $g(d)+\mbox{tf-idf}_{t,d}$. The list itself is, like all the postings lists considered so far, sorted by a common order (either by document IDs or by static quality). Then at query time, we only compute the net scores (35) for documents in the union of these global champion lists. Intuitively, this has the effect of focusing on documents likely to have large net scores.

We conclude the discussion of global champion lists with one further idea. We maintain for each term $t$ two postings lists consisting of disjoint sets of documents, each sorted by $g(d)$ values. The first list, which we call high, contains the $m$ documents with the highest tf values for $t$. The second list, which we call low, contains all other documents containing $t$. When processing a query, we first scan only the high lists of the query terms, computing net scores for any document on the high lists of all (or more than a certain number of) query terms. If we obtain scores for $K$ documents in the process, we terminate. If not, we continue the scanning into the low lists, scoring documents in these postings lists. This idea is developed further in Section 7.2.1 .


next up previous contents index
Next: Impact ordering Up: Efficient scoring and ranking Previous: Champion lists   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07