Edit distance

It is well-known how to compute the (weighted) edit distance between two strings in time , where denotes the length of a string . The idea is to use the dynamic programming algorithm in Figure 3.5 , where the characters in and are given in array form. The algorithm fills the (integer) entries in a matrix whose two dimensions equal the lengths of the two strings whose edit distances is being computed; the entry of the matrix will hold (after the algorithm is executed) the edit distance between the strings consisting of the first characters of and the first characters of . The central dynamic programming step is depicted in Lines 8-10 of Figure 3.5 , where the three quantities whose minimum is taken correspond to substituting a character in , inserting a character in and inserting a character in .

Figure 3.6 shows an example Levenshtein distance computation of Figure 3.5 . The typical cell has four entries formatted as a cell. The lower right entry in each cell is the of the other three, corresponding to the main dynamic programming step in Figure 3.5 . The other three entries are the three entries or 1 depending on whether and . The cells with numbers in italics depict the path by which we determine the Levenshtein distance.

The spelling correction problem however demands more than computing edit distance: given a set of strings (corresponding to terms in the vocabulary) and a query string , we seek the string(s) in of least edit distance from . We may view this as a decoding problem, in which the codewords (the strings in ) are prescribed in advance. The obvious way of doing this is to compute the edit distance from to each string in , before selecting the string(s) of minimum edit distance. This exhaustive search is inordinately expensive. Accordingly, a number of heuristics are used in practice to efficiently retrieve vocabulary terms likely to have low edit distance to the query term(s).

The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string; the hope would be that spelling errors do not occur in the first character of the query. A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $. Consider the set of all rotations of the query string . For each rotation from this set, we traverse the B-tree into the permuterm index, thereby retrieving all dictionary terms that have a rotation beginning with . For instance, if is mase and we consider the rotation , we would retrieve dictionary terms such as semantic and semaphore that do not have a small edit distance to . Unfortunately, we would miss more pertinent dictionary terms such as mare and mane. To address this, we refine this rotation scheme: for each rotation, we omit a suffix of characters before performing the B-tree traversal. This ensures that each term in the set of terms retrieved from the dictionary includes a ``long'' substring in common with . The value of could depend on the length of . Alternatively, we may set it to a fixed constant such as .

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07