next up previous contents index
Next: k-gram indexes for spelling Up: Spelling correction Previous: Forms of spelling correction   Contents   Index


Edit distance

Given two character strings $s_1$ and $s_2$, the edit distance between them is the minimum number of edit operations required to transform $s_1$ into $s_2$. Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance . For example, the edit distance between cat and dog is 3. In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations, for instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice (see Section 3.4 for the separate issue of phonetic similarity). However, the remainder of our treatment here will focus on the case in which all edit operations have the same weight.

It is well-known how to compute the (weighted) edit distance between two strings in time $O(\vert s_1\vert\times\vert s_2\vert)$, where $\vert s_i\vert$ denotes the length of a string $s_i$. The idea is to use the dynamic programming algorithm in Figure 3.5 , where the characters in $s_1$ and $s_2$ are given in array form. The algorithm fills the (integer) entries in a matrix $m$ whose two dimensions equal the lengths of the two strings whose edit distances is being computed; the $(i,j)$ entry of the matrix will hold (after the algorithm is executed) the edit distance between the strings consisting of the first $i$ characters of $s_1$ and the first $j$ characters of $s_2$. The central dynamic programming step is depicted in Lines 8-10 of Figure 3.5 , where the three quantities whose minimum is taken correspond to substituting a character in $s_1$, inserting a character in $s_1$ and inserting a character in $s_2$.

Figure 3.5: Dynamic programming algorithm for computing the edit distance between strings $s_1$ and $s_2$.
\begin{figure}\begin{algorithm}{EditDistance}{s_1,s_2}
int\ m[i,j] = 0\\
\begin...
...d{FOR}\\
\RETURN{m[\vert s_1\vert,\vert s_2\vert]}
\end{algorithm}
\end{figure}

Figure 3.6 shows an example Levenshtein distance computation of Figure 3.5 . The typical cell $[i,j]$ has four entries formatted as a $2\times 2$ cell. The lower right entry in each cell is the $\min$ of the other three, corresponding to the main dynamic programming step in Figure 3.5 . The other three entries are the three entries $m[i-1,j-1] + 0$ or 1 depending on whether $s_1[i]=s_2[j], m[i-1, j] + 1$ and $m[i, j-1] + 1$. The cells with numbers in italics depict the path by which we determine the Levenshtein distance.

\begin{figure}
% latex2html id marker 3900
\begin{tabular}{ \vert\vert c \vert\v...
... The cells in italics determine the edit distance in this example.}
\end{figure}

The spelling correction problem however demands more than computing edit distance: given a set ${\cal S}$ of strings (corresponding to terms in the vocabulary) and a query string $q$, we seek the string(s) in $V$ of least edit distance from $q$. We may view this as a decoding problem, in which the codewords (the strings in $V$) are prescribed in advance. The obvious way of doing this is to compute the edit distance from $q$ to each string in $V$, before selecting the string(s) of minimum edit distance. This exhaustive search is inordinately expensive. Accordingly, a number of heuristics are used in practice to efficiently retrieve vocabulary terms likely to have low edit distance to the query term(s).

The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string; the hope would be that spelling errors do not occur in the first character of the query. A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $. Consider the set of all rotations of the query string $q$. For each rotation $r$ from this set, we traverse the B-tree into the permuterm index, thereby retrieving all dictionary terms that have a rotation beginning with $r$. For instance, if $q$ is mase and we consider the rotation $r=\mbox{\term{sema}}$, we would retrieve dictionary terms such as semantic and semaphore that do not have a small edit distance to $q$. Unfortunately, we would miss more pertinent dictionary terms such as mare and mane. To address this, we refine this rotation scheme: for each rotation, we omit a suffix of $\ell$ characters before performing the B-tree traversal. This ensures that each term in the set $R$ of terms retrieved from the dictionary includes a ``long'' substring in common with $q$. The value of $\ell$ could depend on the length of $q$. Alternatively, we may set it to a fixed constant such as $2$.


next up previous contents index
Next: k-gram indexes for spelling Up: Spelling correction Previous: Forms of spelling correction   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07