This enumeration can be expensive if we find many corrections of the individual terms, since we could encounter a large number of combinations of alternatives. Several heuristics are used to trim this space. In the example above, as we expand the alternatives for flew and form, we retain only the most frequent combinations in the collection or in the query logs, which contain previous queries by users. For instance, we would retain flew from as an alternative to try and extend to a three-term corrected query, but perhaps not fled fore or flea form. In this example, the biword fled fore is likely to be rare compared to the biword flew from. Then, we only attempt to extend the list of top biwords (such as flew from), to corrections of Heathrow. As an alternative to using the biword statistics in the collection, we may use the logs of queries issued by users; these could of course include queries with spelling errors.

**Exercises.**

- If denotes the length of string , show that the edit distance between and is never more than
- Compute the edit distance between paris and
alice. Write down the array of distances
between all prefixes as computed by the algorithm in Figure 3.5 .
- Write pseudocode showing the details of computing on the fly the Jaccard coefficient while scanning the postings of the -gram index, as mentioned on page 3.3.4 .
- Compute the Jaccard coefficients between the query bord and each of the terms in Figure 3.7 that contain the bigram or.
- Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?
- For each of the prefixes of the query -- catched, catched in and catched in the -- we have a number of substitute prefixes arising from each term and its alternatives. Suppose that we were to retain only the top 10 of these substitute prefixes, as measured by its number of occurrences in the collection. We eliminate the rest from consideration for extension to longer prefixes: thus, if batched in is not one of the 10 most common 2-term queries in the collection, we do not consider any extension of batched in as possibly leading to a correction of catched in the rye. How many of the possible substitute prefixes are we eliminating at each phase?
- Are we guaranteed that retaining and extending only the 10 commonest substitute prefixes of catched in will lead to one of the 10 commonest substitute prefixes of catched in the?

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07