next up previous contents index
Next: Designing parsing and scoring Up: Components of an information Previous: Tiered indexes   Contents   Index


Query-term proximity

Especially for free text queries on the web (Chapter 19 ), users prefer a document in which most or all of the query terms appear close to each other, because this is evidence that the document has text focused on their query intent. Consider a query with two or more query terms, $t_1,t_2,\ldots,t_k$. Let $\omega$ be the width of the smallest window in a document $d$ that contains all the query terms, measured in the number of words in the window. For instance, if the document were to simply consist of the sentence The quality of mercy is not strained, the smallest window for the query strained mercy would be 4. Intuitively, the smaller that $\omega$ is, the better that $d$ matches the query. In cases where the document does not contain all of the query terms, we can set $\omega$ to be some enormous number. We could also consider variants in which only words that are not stop words are considered in computing $\omega$. Such proximity-weighted scoring functions are a departure from pure cosine similarity and closer to the ``soft conjunctive'' semantics that Google and other web search engines evidently use.

How can we design such a proximity-weighted scoring function to depend on $\omega$? The simplest answer relies on a ``hand coding'' technique we introduce below in Section 7.2.3 . A more scalable approach goes back to Section 6.1.2 - we treat the integer $\omega$ as yet another feature in the scoring function, whose importance is assigned by machine learning, as will be developed further in Section 15.4.1 .


next up previous contents index
Next: Designing parsing and scoring Up: Components of an information Previous: Tiered indexes   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07