Next: Probabilistic information retrieval Up: XML retrieval Previous: References and further reading Contents Index

Exercises

Exercises.

Find a reasonably sized XML document collection (or a collection using a markup language different from XML like HTML) on the web and download it. Jon Bosak's XML edition of Shakespeare and of various religious works at http://www.ibiblio.org/bosak/ or the first 10,000 documents of the Wikipedia are good choices. Create three CAS topics of the type shown in Figure 10.3 that you would expect to do better than analogous CO topics. Explain why an XML retrieval system would be able to exploit the XML structure of the documents to achieve better retrieval results on the topics than an unstructured retrieval system.
For the collection and the topics in Exercise 10.7 , (i) are there pairs of elements and , with a subelement of such that both answer one of the topics? Find one case each where (ii) (iii) is the better answer to the query.
Implement the (i) SIMMERGE (ii) SIMNOMERGE algorithm in Section 10.3 and run it for the collection and the topics in Exercise 10.7 . (iii) Evaluate the results by assigning binary relevance judgments to the first five documents of the three retrieved lists for each algorithm. Which algorithm performs better?
Are all of the elements in Exercise 10.7 appropriate to be returned as hits to a user or are there elements (as in the example <b>definitely</b> on page 10.2 ) that you would exclude?
We discussed the tradeoff between accuracy of results and dimensionality of the vector space on page 10.3 . Give an example of an information need that we can answer correctly if we index all lexicalized subtrees, but cannot answer if we only index structural terms.
If we index all structural terms, what is the size of the index as a function of text size?
If we index all lexicalized subtrees, what is the size of the index as a function of text size?
Give an example of a query-document pair for which $\mbox{\textsc{SimNoMerge}}(q,d)$ is larger than 1.0.

Next: Probabilistic information retrieval Up: XML retrieval Previous: References and further reading Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07