next up previous contents index
Next: The vector space model Up: Term frequency and weighting Previous: Inverse document frequency   Contents   Index

Tf-idf weighting

We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document. The tf-idf weighting scheme assigns to term $t$ a weight in document $d$ given by


\begin{displaymath}
\mbox{tf-idf}_{t,d} = \mbox{tf}_{t,d} \times \mbox{idf}_t.
\end{displaymath} (22)

In other words, $\mbox{tf-idf}_{t,d}$ assigns to term $t$ a weight in document $d$ that is

  1. highest when $t$ occurs many times within a small number of documents (thus lending high discriminating power to those documents);

  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

  3. lowest when the term occurs in virtually all documents.

At this point, we may view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by (22). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3 . As a first step, we introduce the overlap score measure: the score of a document $d$ is the sum, over all query terms, of the number of times each of the query terms occurs in $d$. We can refine this idea so that we add up not the number of occurrences of each query term $t$ in $d$, but instead the tf-idf weight of each term in $d$.

\begin{displaymath}
\mbox{Score}(q,d)=\sum_{t\in q} \mbox{tf-idf}_{t,d}.
\end{displaymath} (23)

In Section 6.3 we will develop a more rigorous form of Equation 23.

Exercises.


next up previous contents index
Next: The vector space model Up: Term frequency and weighting Previous: Inverse document frequency   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07