This document contains corrections and clarifications to
*Foundations of Statistical Natural Language Processing*
by Christopher D. Manning and Hinrich Schütze. The
home page for the book is at
`http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3391`

.

Note: It's divided into two sections: one for major content mistakes and clarifications, and the second for typos and minor thinko's. You should check both for errors on a page. Errors in pale gray were corrected in either the second or the sixth printing. That is, they are already corrected if you purchased the book in or after 2003.

page 40, line -9: Rather than "and arbitary complements and unions", more precisely "a set which is closed under complement and countable union of its elements". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 77, (2.52): The numbering of the models here as zeroth, first,
and second order follows the terminology of Shannon's original
paper (1951). Note, however, that this terminology is *not*
the same as the order of a Markov chain introduced higher up on
the page: Shannon's second order model is a first order Markov
chain. Shannon's number rather corresponds to the number *n*
of the corresponding *n*-gram model (cf. p. 193). Moreover,
the second order figure is wrong. Second order should be 3.32
bits, third order 3.1 bits, and 2.8 bits is a fourth order (4-gram)
estimate that Cover and Thomas (1991) provide.
(Thanks to "Sharon A. Caraballo"
<caraball .... cs.georgetown.edu>)

page 129, line 12: Korean should not be in the list of languages that do not segment text into words at all, since Korean has made some use of "word" spacing for more than 100 years! (Thanks to Kihwang Lee <leekh .... cogsci.ed.ac.uk>) However, word identification can still be difficult. (Thanks to Tom Emerson <Tree .... basistech.com>)

page 165, unnumbered equation in middle of page:
the numbers in the numerator and denominator should be
5.591 x 10^{-7} and 3.615 x 10^{-7}
(Thanks to
Chao-Lin Liu <chaolin .... cs.nccu.edu.tw>)

page 168, line 3: insert "independent" after "two" -- unless we assume independence (even though it is somewhat unjustified here), there should also be a covariance term. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 171, footnote 5: The \phi^2 statistic is X^2 divided by N (not multiplied). This is correctly used as a measure of strength of association, whereas significance increases with sample size. (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 196, line -13: Change "This will be V^{n-1}" to "This will be V", given the following major clarification: In Section 6.1, the number of 'bins' is used to refer to the number of possible values of the classificatory feature vectors, while (unfortunately) from Section 6.2 on, with this change, the term 'bins' and the letter B is used to refer to the number of values of the target feature. This is V for prediction of the next word, but V^n for predicting the frequency of n-grams. (Thanks to Tibor Kiss <tibor .... linguistics.ruhr-uni-bochum.de>

page 202-203: While the whole corpus had 400,653 word types, the training corpus had only 273,266 word types. This smaller number should have been used as B in the calculation of a Laplace's law estimate of table 6.4 (whereas actually 400,653 was used). The result of this change is that f_{Lap}(0) = 0.000295, and then 99.96% of the probability mass is given to previously unseen bigrams (!). In such a model, note that we have used a (demonstrably wrong) closed vocabulary assumption, so despite this huge mass being given to unseen bigrams, none is being given to potential bigrams using vocabulary items outside the training set vocabulary (OOV = out of vocabulary items). (Thanks to Steve Renals <s.renals .... dcs.shef.ac.uk> and Gary Cottrell <gary .... cs.ucsd.edu>

page 205, line 2-3: Correction: here it is said that there are 14589 word types, but the number given elsewhere in the chapter (and the actual number found on rechecking the data file) is 14585. Clarification: Here we directly smooth the conditional distributions, so there are only |V| = 14585 values for the bigram conditional distribution added into the denominator during smoothing, whereas on pp. 202-203, we were estimating bigram probabilities, and there are |V|^2 different bigrams. (Thanks to Hidetosi Sirai <sirai .... sccs.chukyo-u.ac.jp>, Mark Lewellen <lewellen .... erols.com>, and Gary Cottrell <gary .... cs.ucsd.edu>

page 206, equation 6.10: The formula should be T_r/(N_rT), where T
is the number of tokens in the held out data. The condition on
the right should more clearly be "where
C_{1}(*w*_{1} ...
*w*_{n}) = *r*"
(Thanks to
Stephen Clark <stephenc .... cogsci.ed.ac.uk> and Hidetosi Sirai
<sirai .... sccs.chukyo-u.ac.jp>)

page 209, table 6.6: The totals shown are erroneously for only the first 10 scores, not all 11 (they should be 673 and 593), and then the remaining figures need to be recalculated (to be done). (Thanks to Klaus Hoermann <hoermann .... sfs.nphil.uni-tuebingen.de>)

page 209, line -10: The sums of squares indicate that the variances are close, not the totals, so this should read: 1,375.4 and 1,228.8 are close enough. (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 211, line 2: Rather than "and then averages the two", more accurately it should be "and then does a weighted average of the two according to the proportion of words in N_r^0 versus N_r^1".

page 215, table 6.8: Clarification: The Good-Turing estimates here were smoothed, using the program available on the chapter website.

page 237-238, equation at bottom of p. 237 and figure 7.1 line 4.
The denominator on the right hand side
should really be Sum_{t} C(v_{t}, s_{k}) for
both these equations: i.e., the total number of words appearing in
*s _{k}* contexts.
(Thanks to Ruey-Lung, Hsiao
<b2506023 .... csie.ntu.edu.tw>)

page 254, figure 7.8. 2 (a): The equation for h_{ik} should include a prior term:

Zk = sum over j: [sum over (ci with vj in ci): hik]. (Thanks to Michael Chang <mchang .... ie.cuhk.edu.hk>, Lillian Lee <llee .... cs.cornell.edu>, and Henrik Thesing <Henrik.Thesing .... ruhr-uni-bochum.de>)

page 324, line -2: This constraint should rather be: for all j', j'', b_{ij'k} = b_{ij''k}. (I.e., the emission probabilities are the same regardless of the To state.) (Thanks to Philip Neal Whitman <pwhitman .... ling.ohio-state.edu>)

page 326, line 3: "This process is referred to as decoding". This sentence should be moved to section 9.3.2. It is predicting the values of hidden variables (e.g., the Viterbi algorithm) which is referred to as decoding. If anything, this procedure is referred to as "evaluation" or "density estimation". (Thanks to Joao Ricardo Silva <i22094 .... caravela.di.fc.ul.pt>)

page 330, table 9.2: The entries for delta in the last
column are wrong.
They should be:

delta_CP(4) = 0.01323

delta_IP(4) = 0.00567

The score on the final line for *P*(*X*) should then also be 0.01323.
(Thanks to Michael Roettger
<roettm .... uni-muenster.de> and Bill MacCartney <wcmac .... cs.stanford.edu>)

page 336, middle: the reestimated output B parameter for (IP, cola) should be 0.1463 (not 0.1363). (Thanks to Mike Jahr <mjahr .... cs.stanford.edu>)

page 347, equation 10.7: Repeat the argmax t_{1,n} after the final equals sign. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 349, table 10.4: for the word *bear*,
the entries for
*bear* as a preposition and a noun should be swapped, so that
*bear* appears 10 times as a noun and 0 times as a preposition.
This affects exercises 10.5 and 10.6.

pages 349-350: In figure 10.2 in the tagging algorithm, the words of
the input sentence of length *n* are referred to as
*w*_{2},...,*w*_{n+1} (*w*_{1} is
assumed to be '.'). But the text on page 349 and 350 regards the words
as being *w*_{1},...,*w*_{n}. It would have
been consistent if figure 10.2 and the unnumbered equations on
page 349 had put the beginning period in index position 0.
(Thanks to Karl-Michael Schneider
<schneide .... phil.uni-passau.de>)

page 395-396: The restriction in the sum to *g* ne *j* is
unnecessary and incorrect (it would mean that for a rule A -> B B that
the outside probability contribution of the left hand B would not be
counted). Simply delete the "ne *j*" condition in all 3 equations,
and delete from ", but restrict" until the colon.
(Thanks to Mark Johnson
<Mark_Johnson .... Brown.edu>)

page 488, bottom equation: denominator should be \sum_v z_{v,w_e}, and the line below should read "where the summation ranges over all French words v. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 507, equation 14.3: The v_i and w_i in the denominator of the fraction should be squared, and then under the assumption of normalized vectors, this denominator is just 1 and so the whole is the same as the dot product which should be v.w. [It is probably better to look at pp. 300-301 for this.] (Thanks to Mizuki Fujisawa <mfujisa .... jaist.ac.jp>)

page 508, equation 14.6: The numerator should be: (s(ci) + s(cj))(s(ci) + s(cj))-(|ci|+|cj|). (Thanks to Fernando Diaz <fern .... monkey.org>)

page 523, equation (14.20): the printed version errs by both omitting the priors, and wrongly trying to do a point probability of a continuous density function. It should be: (\pi_j n_j(x_i; \Theta))/(\sum_l \pi_l n_l(x_i; \Theta)) (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 553, section 15.3.5: The definition of
RIDF should read "the
difference between
the logs of actual * inverse* document frequency and
*inverse* document frequency predicted by Poisson."

pages 560-563: There are a number of mess-ups here. You might want to look at a (hopefully!) correct version of the pages (PDF file) [updated Dec 18, 2000]. (Thanks to Elizabeth Jessup <jessup .... dot.cs.colorado.edu> and the people below)

pages 560-561, figures 15.8 and 15.10: The
matrices *T* and *D* should be
transposed. In the text on line 9 of p. 561, the matrices should be
restricted to their first *k* < *n* columns, rather than
rows.
(Thanks to Fred Popowich
<popowich .... cs.sfu.ca>)

page 562-563, figure 15.11 and line 11: The matrix B is misdescribed. It should be B = S_{2x2} D^T_{2xd}. Entry (dim.1, d_3) should be -0.44. In equation (15.13) the final term should be replaced by: A^T A = ... = D S^T S D^T = (S D^T)^T (S D^T). You might want to look at a (hopefully!) correct version of the pages (PDF file) [updated Aug 2, 2006]. (Thanks to Graeme Hirst <gh .... cs.toronto.edu> and Sean Irvine <sean .... intelligenesis.net>)

page 562-563, table 15.9: Several entries in the matrix are not correct, look here for corrected entries. In the text, the corresponding similarities on line -2, page 562, should be (0.94, 0.93, 0.75), not (0.94, 0.93, 0.74), and on line 3, page 563, should be (0.94), not (0.88). (Thanks to Tadashi Yonezaki <yonezaki .... ieee.org>, Steven Lee <steven.lee .... gmail.com>, and Arthur Chan <archan .... cs.cmu.edu>)

page 580 equation (16.1) and text below and page 582, table 16.4:
Although it is nowhere mentioned, logarithms in this section are
natural logarithms (calculated to the base *e*). Natural
logarithms are used to get the term weights shown at the bottom of page
580 and in table 16.3.
The entropies of this section are also thus not measured in bits but in "nats".
While measuring entropies in any base is perfectly
correct (if done consistently!), it's confusing since everywhere
else we have used entropies measured as logs to the base 2.
On page 582, in bits, H of node 1 = 0.881, H of node 2 = 0.518, H of
node 3 = 0.315.
(Thanks to Phil Shinn
<philshinn .... mediaone.net> and
Margalit Zabludowski <margalit .... lexiquest.com>)

pages 589-596: Our implementation of maxent at the time the book was published was buggy. The classification results and feature weights shown, and some of the discussion based on them, are thus wrong (classification accuracy should be 96.2%). Some of the explanations could have been better too. You should look at a (hopefully!) correct revised version of the pages (Postscript file) or in PDF. (Many thanks to Martin Jansche <jansche .... ling.ohio-state.edu> for setting us straight on this.)

page 11, line -13: "changes" should read "change".

page 40, line -6: "written 2^F" should be "written 2^Omega" (Thanks to Kim, Sang-Beom <bewise .... nlp.korea.ac.kr>

page 42, line -2: here and *passim* we use "symmetric" to describe a
*commutative* binary function (whereas the term "symmetric" is more
commonly applied to binary relations).
(Thanks to Tibor Kiss
<tibor .... linguistics.ruhr-uni-bochum.de>

page 46, example 4: "\sum_1^6 x p(y)" should be "\sum_1^6 y p(y)". (Thanks to Emiel Krahmer <e.j.krahmer .... tue.nl>)

page 63, equation (2.29) should end with lowercase letters in log p(x,y). (Thanks to Jan Hajic <hajic .... peregrine.cs.jhu.edu>)

page 69, bottom two unnumbered equations: the probability
distribution (roman)
p in
H(p) refers to the channel probability shown in the middle of
figure 2.7, not to the distribution over the input.
That is H(p(*y*|*x*)).
Using the notation of page 65,
the righthand sides could more clearly have been represented as
*H(Y) - H(p, *1*-p)* and 1 - *H(p, *1*-p)*,
where (italic)
*p* is the parameter shown in figure 2.8.
(Thanks to Taher Haveliwala
<taherh .... CS.Stanford.EDU>)

page 84, line 6: The final consonant is /z/ after schwa, not /s/, at least in nearly all varieties of English. (Thanks to Brian Campbell <bpcam0 .... student.umass.edu>)

page 90, line 1: delete ")" after *-ed*.
(Thanks to Sean Irvine
<sean .... intelligenesis.net>)

page 97, (3.39) "P -> IN NP" should be "PP -> IN NP". (Thanks to Frank Wallace <faw7 .... hotmail.com>)

page 99, figure 3.1: The tag DT should be replaced with AT (4 times). (Thanks to Vander Alves <v.alves .... comp.lancs.ac.uk>)

page 102, line -10: "Many language" --> "Many languages". (Thanks to Sean Irvine <sean .... intelligenesis.net>)

page 119: "poetry can flaunt semantic expectations" --> "poetry can FLOUT semantic expectations". (Thanks to Graeme Hirst <gh .... cs.toronto.edu>)

page 135, figure 4.1 line 6: "does not normally occur word finally" --> "does not normally occur sentence finally" (Thanks to Graeme Hirst <gh .... cs.toronto.edu>)

page 139, example (4.9a): "C" should be "<" (Thanks to Chao-Lin Liu <chaolin .... cs.nccu.edu.tw>)

page 157, exercise 5.3: "stong" --> "strong" (Thanks to Tom Laureys <Tom.Laureys .... esat.kuleuven.ac.be>)

page 161, line 7: change "randomness" to "variability". (Thanks to John Tukey)

page 163, line -1: "tasks means the we can" --> "tasks means that we can". (Thanks to Sun Honglin <cshlsun .... comp.polyu.edu.hk>)

page 164, footnote 4: "degress" --> "degrees" (Thanks to Dino Azzano <dino.azzano .... gmail.com>

page 169, table 5.8: The bottom right entry should be 14287173. This number should then also appear in place of 14287181 in the calculation below equation 5.7 on page 170, but the result remains approximately 1.55. (Thanks to Dae Hwan Kim <dhk .... kle.postech.ac.kr>)

pages 169-171: \chi^2 vs. X^2 was rather inconsistently used, but our intention was to distinguish the \chi^2 distribution from the X^2 test statistic, which has a \chi^2 distribution (asymptotically). Page 169, line -6: replace \chi^2 with X^2. Page 170, equation 5.7 and the line below equation 5.7: replace \chi^2 with X^2. Page 171 lines 5 and 6: replace \chi^2 with X^2.

page 173, equation 5.10: the *b*(.)'s here
should be in roman type as the same binomial distribution shown
in (5.9). Added that for the *New York Times* corpus that
N = 14,307,668. The b(.,.,.) terms in the equation should be
b(.;.,.) to be consistent with the usage above and elsewhere in
the book.
(Thanks to Mark-Jan Nederhof
<markjan .... let.rug.nl>)

page 175, line 1: reference to (5.11) should be to (5.10) (Thanks to Mark-Jan Nederhof <markjan .... let.rug.nl>)

page 180, line 5: "Chambre de communes" --> "Chambre des communes" (Thanks to Gwillim Law <glaw .... measinc.com>)

page 184, line 17: "for for" --> "for" (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 188, line -10: "International Bureau Machines" should read "International Business Machines"!

page 204, line 16: "two much" --> "too much". (Thanks to Mark Lewellen <lewellen .... erols.com> and Jeremy Pickens <jeremy .... cs.umass.edu>

page 206, line 9: "bigrams" should be "*n*-grams".
(Thanks to
Chao-Lin Liu <chaolin .... cs.nccu.edu.tw>)

page 213, line 6: "estimate *A* and *b*" should be
"estimate *a* and *b*".
(Thanks to Tibor Kiss
<tibor .... linguistics.ruhr-uni-bochum.de>

page 213, line 8: "log *N _{r}* =

page 242, footnote 4: "Lawrence" --> "Laurence" (Thanks to Graeme Hirst <gh .... cs.toronto.edu>)

page 273, line 3: "indicate" --> "indicates" (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 274, line 10: "error rate for cue f^j" --> "error rate for cue c^j" (Thanks to Hoojung CHUNG <hjchung .... nlp.korea.ac.kr>)

page 279, line -5: "with the spoon" --> "with a spoon" (Thanks to Hoojung CHUNG <hjchung .... nlp.korea.ac.kr>)

page 282, equation 8.21: "P(NA_{p}=0|*v*)" should be P(NA_{p}=0|*n*)
(Thanks to JooYoung Lee
<jylee .... nlp.korea.ac.kr>)

page 283, line 4: "equation (8.22)" --> "equation (8.21)"

page 284, line 5: "formula (8.22)" --> "formula (8.21)"

page 295, line -8: "misions" --> "missions" (Thanks to Hidetosi Sirai <sirai .... sccs.chukyo-u.ac.jp>)

page 299, line 2: "X intersected with Y" --> "|X intersected with Y|" (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 301, line 5, 8.41: The first x is missing a vector. (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 303, line -6: Replace both instances of "row" by "column". (Probability distributions can of course be constructed using either rows or columns, but the example immediately below uses columns. (Thanks to Andy Kacsmar <andy .... DB.Stanford.EDU>)

page 313, line 5: "we we" --> "when we".

page 342, line 15: "the word to" --> "the infinitive marker to"

page 347, line 3: replace comma with colon in numerator (Thanks to Adam Przepiorkowski <adamp .... bach.ipipan.waw.pl>)

page 347, line -8: replace sentence in parentheses with: "To simplify our notation, we assume that t0 = PERIOD precedes the sentence."

page 347, equation 10.5: "P(t_n - 1 | t_{n-2})" should be "P(t_{n-1} | t_{n-2})". (Thanks to Lee, Do-Gil <dglee .... nlp.korea.ac.kr>)

page 348, Figure 10.1, line 8: replace comma with colon in numerator (Thanks to Adam Przepiorkowski <adamp .... bach.ipipan.waw.pl>)

page 348, exercise 10.5: "shows" --> "shown".

page 350, line -12: "9.3.2" --> "9.3". (Thanks to Jim Thanos <jim_thans .... hotmail.com>)

page 355: first unnumbered equation: "P(t^j|t^{j+1)" should be "P(t^j|t^{j-1})". (Thanks to Lee, Do-Gil <dglee .... nlp.korea.ac.kr>)

page 355, numerator of second unnumbered equation: reverse order of arguments of "C", replace comma with colon

page 356, under example 10.11: the references
to "*P*(*a*)", etc. should have the "a", "b", "c" in
roman, referring to the items in 10.11.

page 390, line -6 (unnumbered display): The node above *box*
should be labeled N^{0} to be consistent with (11.13) on
page 391.
(Thanks to Kemal Oflazer
<oflazer .... sabanciuniv.edu>)

page 400, unnumbered equation between (11.26) and (11.27): This
should begin with *E* as an expectation, not *P*.
(Thanks to Luc De Raedt
<Luc.DeRaedt .... cs.kuleuven.be>)

page 436, line 12: delete "unfortunately".

page 442, line 19: "see below" --> "see section 12.2.2".

page 444, line 11: add ")" after "cf. section 8.3".

page 454, final line above section 12.2.4: Delete ")" (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 458, (12.36j): Change "V" to "VBZ" (Thanks to Barbara Plank <bplank .... inf.unibz.it>)

page 475, line 5: "alignments" --> "alignment". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 481, middle: change L to m_A in the unnumber arg max equation and twice in the line below: both L and m_A were referring to the number of beads in the aligned texts. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 502, figure 14.2, line 7: "C" --> "|C|"; line 9: insert ":" before "=". figure 14.3, line 8: insert ":" before "=". (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 520, line 19: change subset or equal sign to proper subset sign: X cannot be equal to R^m since it is a finite subset. (Thanks to Stephen Cronen-Townsend <crotown .... cs.umass.edu>)

page 520, equation (14.17): the 2nd argument of the function on the left hand side should be mu_j, not m_j -- as in equation (14.18) below. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 520, equation (14.18): n should be n_j. (The indices on n here are just for identification, note). (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 521, table 14.5, header line 1: \mu_j should have a vector sign above it, and should be separated from \Sigma_j by a comma. (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 524, unnumbered equation defining Q: add extra open parenthesis before X, so one has "E(l((X,Z)|Theta)|X,Theta^k)". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 541, equation 15.3: move number to top of displayed equations.

page 542, line -9: add "relative" before "importance". (Thanks to John Tukey)

page 544, section 15.2.2, table 15.5: The entry for n in the column "Document frequency" should be: n (none) 1.

page 553, line -5 and unnumbered equation above it: For consistency,
with the Poisson distribution parameter, we should have RIDF_{i},
IDF_{i}, and df_{i}.
(Thanks to Piotr Gawrysiak
<pgawrysiak .... supermedia.pl>

page 571, line 4: "Lewis and Jones" --> "Lewis and Sparck Jones". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

pages 578, figure 16.1: The test for the bottom right node should be "vs ≥ 2". (Thanks to Barbara Plank <b.plank .... rug.nl>)

page 584, line 4: add a comma after "set". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 593, third bullet: "go to 2" should say "go to second bullet: Update the parameters ...". (In the 2nd printing the bullets are replaced with numbered items.) (Thanks to Martin Jansche <jansche .... ling.ohio-state.edu>)

page 599, first (unnumbered) displayed equation: add subscript "j" to "x-vector". (Here we are describing evaluating the decision procedure on the training instances so as to adjust the weights and threshold, but the same decision procedure is used on test instances.). (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 607, line 4: "Yang (1998)" --> "Yang (1999)".

page 628, line -10: "Ralph Grishman Claudia Gdaniec" --> "Claudia Gdaniec, Ralph Grishman". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 636, line 8: Lari and Young 1991 title should begin "Applications of stochastic context free grammars". (Thanks to Lillian Lee <llee .... cs.cornell.edu>)

page 654, middle: Correct the entry for Yang
(1998) to:

Yang, Yiming. 1999. An evaluation of statistical approaches to text
categorization. *Information Retrieval* 1: 69-90.

About a dozen index entries were added in the second printing

Christopher Manning and Hinrich Schütze -- last modified