In this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 10 the basic setting we assume in structured retrieval: the collection consists of structured documents and queries are either structured (as in Figure 10.3 ) or unstructured (e.g., summer holidays).
The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval. If we query Shakespeare's plays for Macbeth's castle, should we return the scene, the act or the entire play in Figure 10.2 ? In this case, the user is probably looking for the scene. On the other hand, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit.
One criterion for selecting the most appropriate part of a document is the structured document retrieval principle :
Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically. Consider the query
title#"Macbeth"
applied to Figure 10.2 . The
title of the tragedy,
Macbeth, and the title of Act I, Scene
vii, Macbeth's castle, are both good hits because they
contain the matching term Macbeth.
But in this case, the title of the tragedy,
the higher node, is preferred. Deciding which level of the
tree is right for answering a query is difficult.
Parallel to the issue of which parts of a document to return
to the user
is the issue of which parts of a document to index.
In Section 2.1.2 (page ), we discussed the need for a
document unit or indexing unit in indexing and retrieval. In unstructured
retrieval, it is usually clear what the right document unit
is: files on your desktop, email messages, web pages on the
web etc. In structured retrieval,
there are a number of different approaches to defining
the indexing unit.
One approach is to group nodes into non-overlapping pseudodocuments as shown in Figure 10.5 . In the example, books, chapters and sections have been designated to be indexing units, but without overlap. For example, the leftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudodocuments may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in Figure 10.5 merges three disparate elements, the class, author and title elements.
We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books or the play element for Shakespeare's works. We can then postprocess search results to find for each book or play the subelement that is the best hit. For example, the query Macbeth's castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii as the best-matching subelement. Unfortunately, this two-stage retrieval process fails to return the best subelement for many queries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it.
Instead of retrieving large units and identifying subelements (top down), we can also search all leaves, select the most relevant ones and then extend them to larger units in postprocessing (bottom up). For the query Macbeth's castle in Figure 10.1 , we would retrieve the title Macbeth's castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the act or the play. This approach has a similar problem as the last one: The relevance of a leaf element is often not a good predictor of the relevance of elements it is contained in.
The least restrictive approach is to index all
elements. This is also problematic. Many
XML elements are not meaningful search results, e.g.,
typographical elements like
<b>definitely</b>
or an ISBN
number which cannot be interpreted without context.
Also, indexing all elements means that search results
will be highly redundant.
For the query Macbeth's castle and the document in Figure 10.1 ,
we would return all of
the
play, act, scene and title
elements on the path between the root node and
Macbeth's castle.
The leaf node would then occur four times in the result set,
once directly and three times as part of other elements.
We call elements that are contained within each other
nested .
Returning redundant nested elements in a list of
returned hits
is not very user-friendly.
Because of the redundancy caused by nested elements it is common to restrict the set of elements that are eligible to be returned. Restriction strategies include:
If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type. But as we discussed in the introduction, users often don't know what the name of an element in the collection is (Is the Vatican a country or a city?) or they may not know how to compose structured queries at all.
A challenge in XML retrieval related to nesting is that we may need to
distinguish different contexts of a term when we compute
term statistics for ranking, in particular inverse document
frequency ( idf ) statistics as defined in Section 6.2.1 (page ).
For example, the term Gates under the
node author is unrelated to an occurrence under a
content node like section if used to refer to the
plural of gate. It makes little
sense to compute a single document frequency for
Gates in this example.
One solution is to compute idf for
XML-contextterm pairs, e.g., to compute different
idf weights for author#"Gates"
and
section#"Gates"
. Unfortunately, this scheme will run
into sparse data problems - that is, many XML-context
pairs occur too rarely to reliably estimate df (see
Section 13.2 , page 13.2 , for a
discussion of sparseness).
A compromise is only to consider the parent node of the term
and not the rest of the path from the root to
to
distinguish contexts.
There are still conflations of
contexts that are harmful in this scheme. For instance, we
do not distinguish names of
authors and names of corporations if both have the
parent node name.
But most important distinctions, like the
example contrast
author#"Gates"
vs. section#"Gates"
, will be respected.
In many cases, several different XML schemas occur in a
collection since the XML documents in an IR application
often come from more than one source. This phenomenon is
called
schema heterogeneity
or
schema diversity and
presents yet
another challenge.
As illustrated in
Figure 10.6 comparable elements may have different
names: creator in vs. author in
. In other cases,
the structural
organization of the schemas may be different: Author names are
direct descendants of the node author in
,
but there are the intervening nodes firstname and
lastname in
. If we employ strict matching of
trees, then
will retrieve neither
nor
although both documents are relevant.
Some form
of approximate matching of element names in combination with
semi-automatic matching of different document structures can
help here. Human editing of correspondences of elements in
different schemas will usually do better than automatic
methods.
Schema heterogeneity is one reason for query-document
mismatches like and
. Another reason is
that users often are not familiar with the element names
and the structure of the schemas of collections
they search as mentioned. This poses a challenge for interface design in
XML retrieval. Ideally, the user interface should expose
the tree structure of the collection and allow users to
specify the elements they are querying. If we take this
approach, then
designing the
query interface in structured retrieval is more complex than a search box for
keyword queries in unstructured retrieval.
We can also support the user by interpreting all
parent-child relationships in queries as descendant
relationships with any number of intervening nodes
allowed. We call such queries
extended queries . The tree in Figure 10.3 and in Figure 10.6 are examples of
extended queries. We show edges that are interpreted as
descendant relationships as dashed arrows. In
, a
dashed arrow connects book and
Gates.
As a pseudo-XPath notation for
,
we adopt
book//#"Gates"
: a book that somewhere in its structure
contains the word Gates where the path from the
book node to Gates can be arbitrarily long.
The pseudo-XPath notation for the extended query that in addition
specifies that Gates occurs
in a section
of the book
is book//section//#"Gates"
.
It is convenient for users to be able to
issue such extended queries without having to specify the exact
structural configuration in which a query term should occur
- either because they do not care about the exact
configuration or because they do not know enough about the
schema of the collection to be able to specify it.
In Figure 10.7 , the user is looking for a
chapter entitled FFT (). Suppose there is no
such chapter in the collection, but that there are
references to books on FFT (
). A reference to a book on
FFT is not exactly what the user is looking for, but it is
better than returning nothing. Extended queries do not
help here. The extended query
also returns nothing. This is a case
where we may want to interpret the structural constraints
specified in the query as hints as opposed to as strict
conditions. As we will discuss in
Section 10.4 , users prefer a relaxed interpretation of
structural constraints: Elements that do not meet structural
constraints perfectly should be ranked lower, but they
should not be omitted from search results.