next up previous contents index
Next: Critiques and justifications of Up: Evaluation in information retrieval Previous: Evaluation of ranked retrieval   Contents   Index

Assessing relevance

To properly evaluate a system, your test information needs must be germane to the documents in the test document collection, and appropriate for predicted usage of the system. These information needs are best designed by domain experts. Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs.

Given information needs and documents, you need to collect relevance assessments. This is a time-consuming and expensive process involving human beings. For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained. For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query. The most standard approach is pooling , where relevance is assessed over a subset of the collection that is formed from the top $k$ documents returned by a number of different IR systems (usually the ones to be evaluated), and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.

Table 8.2: Calculating the kappa statistic.
    Judge 2 Relevance  
  Yes  No Total
Judge 1 Yes 300  20 320
Relevance No 10  70 80
  Total 310  90 400

Observed proportion of the times the judges agreed $P(A) = (300+70)/400 = 370/400 = 0.925$
Pooled marginals $P(nonrelevant) = (80+90)/(400+400) = 170/800 = 0.2125$
$P(relevant) = (320+310)/(400+400) = 630/800 = 0.7878$
Probability that the two judges agreed by chance $P(E) = P(nonrelevant)^2 + P(relevant)^2 = 0.2125^2 + 0.7878^2 = 0.665$
Kappa statistic $\kappa = (P(A) - P(E))/(1-P(E)) = (0.925 - 0.665)/(1 - 0.665) = 0.776$

A human is not a device that reliably reports a gold standard judgment of relevance of a document to a query. Rather, humans and their relevance judgments are quite idiosyncratic and variable. But this is not a problem to be solved: in the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic humans, one information need at a time.

Nevertheless, it is interesting to consider and measure how much agreement between judges there is on relevance judgments. In the social sciences, a common measure for agreement between judges is the kappa statistic . It is designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.

\mbox{\emph{kappa}} = \frac{P(A) - P(E)}{1 - P(E)}
\end{displaymath} (46)

where $P(A)$ is the proportion of the times the judges agreed, and $P(E)$ is the proportion of the times they would be expected to agree by chance. There are choices in how the latter is estimated: if we simply say we are making a two-class decision and assume nothing more, then the expected chance agreement rate is 0.5. However, normally the class distribution assigned is skewed, and it is usual to use marginal statistics to calculate expected agreement.[*]There are still two ways to do it depending on whether one pools the marginal distribution across judges or uses the marginals for each judge separately; both forms have been used, but we present the pooled version because it is more conservative in the presence of systematic differences in assessments across judges. The calculations are shown in Table 8.2 . The kappa value will be 1 if two judges always agree, 0 if they agree only at the rate given by chance, and negative if they are worse than random. If there are more than two judges, it is normal to calculate an average pairwise kappa value. As a rule of thumb, a kappa value above 0.8 is taken as good agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement below 0.67 is seen as data providing a dubious basis for an evaluation, though the precise cutoffs depend on the purposes for which the data will be used.

Interjudge agreement of relevance has been measured within the TREC evaluations and for medical IR collections. Using the above rules of thumb, the level of agreement normally falls in the range of ``fair'' (0.67-0.8). The fact that human agreement on a binary relevance judgment is quite modest is one reason for not requiring more fine-grained relevance labeling from the test set creator. To answer the question of whether IR evaluation results are valid despite the variation of individual assessors' judgments, people have experimented with evaluations taking one or the other of two judges' opinions as the gold standard. The choice can make a considerable absolute difference to reported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a single system which are being compared for effectiveness.

next up previous contents index
Next: Critiques and justifications of Up: Evaluation in information retrieval Previous: Evaluation of ranked retrieval   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.