Next: Exercises Up: XML retrieval Previous: Text-centric vs. data-centric XML Contents Index

References and further reading

There are many good introductions to XML, including (Harold and Means, 2004). Table 10.1 is inspired by a similar table in (van Rijsbergen, 1979). Section 10.4 follows the overview of INEX 2002 by Gövert and Kazai (2003), published in the proceedings of the meeting (Fuhr et al., 2003a). The proceedings of the four following INEX meetings were published as Fuhr et al. (2003b), Fuhr et al. (2005), Fuhr et al. (2006), and Fuhr et al. (2007). An uptodate overview article is Fuhr and Lalmas (2007). The results in Table 10.4 are from (Kamps et al., 2006). Chu-Carroll et al. (2006) also present evidence that XML queries increase precision compared with unstructured queries. Instead of coverage and relevance, INEX now evaluates on the related but different dimensions of exhaustivity and specificity (Lalmas and Tombros, 2007). Trotman et al. (2006) relate the tasks investigated at INEX to real world uses of structured retrieval such as structured book search on internet bookstore sites.

The structured document retrieval principle is due to Chiaramella et al. (1996). Figure 10.5 is from (Fuhr and Großjohann, 2004). Rahm and Bernstein (2001) give a survey of automatic schema matching that is applicable to XML. The vector-space based XML retrieval method in Section 10.3 is essentially IBM Haifa's JuruXML system as presented by Mass et al. (2003) and Carmel et al. (2003). Schlieder and Meuss (2002) and Grabs and Schek (2002) describe similar approaches. Carmel et al. (2003) represent queries as XML fragments . The trees that represent XML queries in this chapter are all XML fragments, but XML fragments also permit the operators , and phrase on content nodes.

We chose to present the vector space model for XML retrieval because it is simple and a natural extension of the unstructured vector space model in Chapter 6 . But many other unstructured retrieval methods have been applied to XML retrieval with at least as much success as the vector space model. These methods include language models (cf. Chapter 12 , e.g., Kamps et al. (2004), Ogilvie and Callan (2005), List et al. (2005)), systems that use a relational database as a backend (Theobald et al., 2008;2005, Mihajlovic et al., 2005), probabilistic weighting (Lu et al., 2007), and fusion (Larson, 2005). There is currently no consensus as to what the best approach to XML retrieval is.

Most early work on XML retrieval accomplished relevance ranking by focusing on individual terms, including their structural contexts, in query and document. As in unstructured information retrieval, there is a trend in more recent work to model relevance ranking as combining evidence from disparate measurements about the query, the document and their match. The combination function can be tuned manually (Sigurbjörnsson et al., 2004, Arvola et al., 2005) or trained using machine learning methods (Vittaut and Gallinari (2006), cf. mls).

An active area of XML retrieval research is focused retrieval (Trotman et al., 2007), which aims to avoid returning nested elements that share one or more common subelements (cf. discussion in Section 10.2 , page 10.2 ). There is evidence that users dislike redundancy caused by nested elements (Betsi et al., 2006). Focused retrieval requires evaluation measures that penalize redundant results lists (Kazai and Lalmas, 2006, Lalmas et al., 2007). Trotman and Geva (2006) argue that XML retrieval is a form of passage retrieval . In passage retrieval (Kaszkiel and Zobel, 1997, Salton et al., 1993, Hearst and Plaunt, 1993, Hearst, 1997, Zobel et al., 1995), the retrieval system returns short passages instead of documents in response to a user query. While element boundaries in XML documents are cues for identifying good segment boundaries between passages, the most relevant passage often does not coincide with an XML element.

In the last several years, the query format at INEX has been the NEXI standard proposed by Trotman and Sigurbjörnsson (2004). Figure 10.3 is from their paper. O'Keefe and Trotman (2004) give evidence that users cannot reliably distinguish the child and descendant axes. This justifies only permitting descendant axes in NEXI (and XML fragments). These structural constraints were only treated as ``hints'' in recent INEXes. Assessors can judge an element highly relevant even though it violates one of the structural constraints specified in a NEXI query.

An alternative to structured query languages like NEXI is a more sophisticated user interface for query formulation (Tannier and Geva, 2005, van Zwol, 2006, Woodley and Geva, 2006).

A broad overview of XML retrieval that covers database as well as IR approaches is given by Amer-Yahia and Lalmas (2006) and an extensive reference list on the topic can be found in (Amer-Yahia et al., 2005). Chapter 6 of Grossman and Frieder (2004) is a good introduction to structured text retrieval from a database perspective. The proposed standard for XQuery is available at http://www.w3.org/TR/xquery/ including an extension for full-text queries (Amer-Yahia et al., 2006): http://www.w3.org/TR/xquery-full-text/. Work that has looked at combining the relational database and the unstructured information retrieval approaches includes (Fuhr and Rölleke, 1997), (Navarro and Baeza-Yates, 1997), (Cohen, 1998), and (Chaudhuri et al., 2006).

Next: Exercises Up: XML retrieval Previous: Text-centric vs. data-centric XML Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07