Especially for free text queries on the web (Chapter 19 ), users prefer a document in which most or all of the query terms appear close to each other, because this is evidence that the document has text focused on their query intent. Consider a query with two or more query terms, . Let be the width of the smallest window in a document that contains all the query terms, measured in the number of words in the window. For instance, if the document were to simply consist of the sentence The quality of mercy is not strained, the smallest window for the query strained mercy would be 4. Intuitively, the smaller that is, the better that matches the query. In cases where the document does not contain all of the query terms, we can set to be some enormous number. We could also consider variants in which only words that are not stop words are considered in computing . Such proximity-weighted scoring functions are a departure from pure cosine similarity and closer to the ``soft conjunctive'' semantics that Google and other web search engines evidently use.
How can we design such a proximity-weighted scoring function to depend on ? The simplest answer relies on a ``hand coding'' technique we introduce below in Section 7.2.3 . A more scalable approach goes back to Section 6.1.2 - we treat the integer as yet another feature in the scoring function, whose importance is assigned by machine learning, as will be developed further in Section 15.4.1 .