Tf-idf weighting

Next: The vector space model Up: Term frequency and weighting Previous: Inverse document frequency Contents Index

Tf-idf weighting

We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document. The tf-idf weighting scheme assigns to term a weight in document given by

$\begin{displaymath} \mbox{tf-idf}_{t,d} = \mbox{tf}_{t,d} \times \mbox{idf}_t. \end{displaymath}$

(22)

In other words, $\mbox{tf-idf}_{t,d}$ assigns to term a weight in document that is

highest when occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.

At this point, we may view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by (22). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3 . As a first step, we introduce the overlap score measure: the score of a document is the sum, over all query terms, of the number of times each of the query terms occurs in . We can refine this idea so that we add up not the number of occurrences of each query term in , but instead the tf-idf weight of each term in .

$\begin{displaymath} \mbox{Score}(q,d)=\sum_{t\in q} \mbox{tf-idf}_{t,d}. \end{displaymath}$

(23)

In Section 6.3 we will develop a more rigorous form of Equation 23.

Exercises.

Why is the idf of a term always finite?
What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9 .

Figure 6.9: Table of tf values for Exercise 6.2.2.
$\begin{figure}\begin{tabular}{\vert\vert l\vert r\vert r\vert r\vert\vert} \hlin... ...rance & 0 & 33 & 29 \\ best & 14 & 0 & 17 \\ \hline \end{tabular} \end{figure}$

Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8 .
Can the tf-idf weight of a term in a document exceed 1?
How does the base of the logarithm in (21) affect the score calculation in (23)? How does the base of the logarithm affect the relative scores of two documents on a given query?
If the logarithm in (21) is computed base 2, suggest a simple approximation to the idf of a term.

**Figure 6.9:** Table of tf values for Exercise 6.2.2.
$\begin{figure}\begin{tabular}{\vert\vert l\vert r\vert r\vert r\vert\vert} \hlin... ...rance & 0 & 33 & 29 \\ best & 14 & 0 & 17 \\ \hline \end{tabular} \end{figure}$

Next: The vector space model Up: Term frequency and weighting Previous: Inverse document frequency Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07