next up previous contents index
Next: General wildcard queries Up: Dictionaries and tolerant retrieval Previous: Search structures for dictionaries   Contents   Index


Wildcard queries

Wildcard queries are used in any of the following situations: (1) the user is uncertain of the spelling of a query term (e.g., Sydney vs. Sidney, which leads to the wildcard query S*dney); (2) the user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants (e.g., color vs. colour); (3) the user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming (e.g., judicial vs. judiciary, leading to the wildcard query judicia*); (4) the user is uncertain of the correct rendition of a foreign word or phrase (e.g., the query Universit* Stuttgart).

A query such as mon* is known as a trailing wildcard query , because the * symbol occurs only once, at the end of the search string. A search tree on the dictionary is a convenient way of handling trailing wildcard queries: we walk down the tree following the symbols m, o and n in turn, at which point we can enumerate the set $W$ of terms in the dictionary with the prefix mon. Finally, we use $\vert W\vert$ lookups on the standard inverted index to retrieve all documents containing any term in $W$.

But what about wildcard queries in which the * symbol is not constrained to be at the end of the search string? Before handling this general case, we mention a slight generalization of trailing wildcard queries. First, consider leading wildcard queries, or queries of the form *mon. Consider a reverse B-tree on the dictionary - one in which each root-to-leaf path of the B-tree corresponds to a term in the dictionary written backwards: thus, the term lemon would, in the B-tree, be represented by the path root-n-o-m-e-l. A walk down the reverse B-tree then enumerates all terms $R$ in the vocabulary with a given prefix.

In fact, using a regular B-tree together with a reverse B-tree, we can handle an even more general case: wildcard queries in which there is a single * symbol, such as se*mon. To do this, we use the regular B-tree to enumerate the set $W$ of dictionary terms beginning with the prefix se, then the reverse B-tree to enumerate the set $R$ of terms ending with the suffix mon. Next, we take the intersection $W\cap R$ of these two sets, to arrive at the set of terms that begin with the prefix se and end with the suffix mon. Finally, we use the standard inverted index to retrieve all documents containing any terms in this intersection. We can thus handle wildcard queries that contain a single * symbol using two B-trees, the normal B-tree and a reverse B-tree.



Subsections
next up previous contents index
Next: General wildcard queries Up: Dictionaries and tolerant retrieval Previous: Search structures for dictionaries   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07