Foundations of Statistical Natural Language Processing

Errata and Clarifications

This document contains corrections and clarifications to Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze. The home page for the book is at

Note: It's divided into two sections: one for major content mistakes and clarifications, and the second for typos and minor thinko's. You should check both for errors on a page. Errors in pale gray were corrected in either the second or the sixth printing. That is, they are already corrected if you purchased the book in or after 2003.

Significant errors and clarifications

Chapter 2

page 40, line -9: Rather than "and arbitary complements and unions", more precisely "a set which is closed under complement and countable union of its elements". (Thanks to Lillian Lee <llee ....>)

page 77, (2.52): The numbering of the models here as zeroth, first, and second order follows the terminology of Shannon's original paper (1951). Note, however, that this terminology is not the same as the order of a Markov chain introduced higher up on the page: Shannon's second order model is a first order Markov chain. Shannon's number rather corresponds to the number n of the corresponding n-gram model (cf. p. 193). Moreover, the second order figure is wrong. Second order should be 3.32 bits, third order 3.1 bits, and 2.8 bits is a fourth order (4-gram) estimate that Cover and Thomas (1991) provide. (Thanks to "Sharon A. Caraballo" <caraball ....>)

Chapter 4

page 129, line 12: Korean should not be in the list of languages that do not segment text into words at all, since Korean has made some use of "word" spacing for more than 100 years! (Thanks to Kihwang Lee <leekh ....>) However, word identification can still be difficult. (Thanks to Tom Emerson <Tree ....>)

Chapter 5

page 165, unnumbered equation in middle of page: the numbers in the numerator and denominator should be 5.591 x 10-7 and 3.615 x 10-7 (Thanks to Chao-Lin Liu <chaolin ....>)

page 168, line 3: insert "independent" after "two" -- unless we assume independence (even though it is somewhat unjustified here), there should also be a covariance term. (Thanks to Lillian Lee <llee ....>)

page 171, footnote 5: The \phi^2 statistic is X^2 divided by N (not multiplied). This is correctly used as a measure of strength of association, whereas significance increases with sample size. (Thanks to Martin Jansche <jansche ....>)

Chapter 6

page 196, line -13: Change "This will be V^{n-1}" to "This will be V", given the following major clarification: In Section 6.1, the number of 'bins' is used to refer to the number of possible values of the classificatory feature vectors, while (unfortunately) from Section 6.2 on, with this change, the term 'bins' and the letter B is used to refer to the number of values of the target feature. This is V for prediction of the next word, but V^n for predicting the frequency of n-grams. (Thanks to Tibor Kiss <tibor ....>

page 202-203: While the whole corpus had 400,653 word types, the training corpus had only 273,266 word types. This smaller number should have been used as B in the calculation of a Laplace's law estimate of table 6.4 (whereas actually 400,653 was used). The result of this change is that f_{Lap}(0) = 0.000295, and then 99.96% of the probability mass is given to previously unseen bigrams (!). In such a model, note that we have used a (demonstrably wrong) closed vocabulary assumption, so despite this huge mass being given to unseen bigrams, none is being given to potential bigrams using vocabulary items outside the training set vocabulary (OOV = out of vocabulary items). (Thanks to Steve Renals <s.renals ....> and Gary Cottrell <gary ....>

page 205, line 2-3: Correction: here it is said that there are 14589 word types, but the number given elsewhere in the chapter (and the actual number found on rechecking the data file) is 14585. Clarification: Here we directly smooth the conditional distributions, so there are only |V| = 14585 values for the bigram conditional distribution added into the denominator during smoothing, whereas on pp. 202-203, we were estimating bigram probabilities, and there are |V|^2 different bigrams. (Thanks to Hidetosi Sirai <sirai ....>, Mark Lewellen <lewellen ....>, and Gary Cottrell <gary ....>

page 206, equation 6.10: The formula should be T_r/(N_rT), where T is the number of tokens in the held out data. The condition on the right should more clearly be "where C1(w1 ... wn) = r" (Thanks to Stephen Clark <stephenc ....> and Hidetosi Sirai <sirai ....>)

page 209, table 6.6: The totals shown are erroneously for only the first 10 scores, not all 11 (they should be 673 and 593), and then the remaining figures need to be recalculated (to be done). (Thanks to Klaus Hoermann <hoermann ....>)

page 209, line -10: The sums of squares indicate that the variances are close, not the totals, so this should read: 1,375.4 and 1,228.8 are close enough. (Thanks to Martin Jansche <jansche ....>)

page 211, line 2: Rather than "and then averages the two", more accurately it should be "and then does a weighted average of the two according to the proportion of words in N_r^0 versus N_r^1".

page 215, table 6.8: Clarification: The Good-Turing estimates here were smoothed, using the program available on the chapter website.

Chapter 7

page 237-238, equation at bottom of p. 237 and figure 7.1 line 4. The denominator on the right hand side should really be Sumt C(vt, sk) for both these equations: i.e., the total number of words appearing in sk contexts. (Thanks to Ruey-Lung, Hsiao <b2506023 ....>)

page 254, figure 7.8. 2 (a): The equation for h_{ik} should include a prior term:

h_{ik} = [ P(s_k) P(c_i | s_k) ] / [ \sum_{k=1}^{K} P(s_k) P( c_i | s_k) ]
2 (a): In the last line of the E-step, the product must be interpreted as j = 1, ..., J. This is a bit confusing, since we otherwise mainly iterate over the words in the context. It could alternatively have been expressed as Product vj in ci, and the exponent dropped. 2 (b): The M-step should be computed as follows. P(vj|sk) = [sum over (ci with vj in ci): hik] / Zk.
Zk = sum over j: [sum over (ci with vj in ci): hik].
(Thanks to Michael Chang <mchang ....>, Lillian Lee <llee ....>, and Henrik Thesing <Henrik.Thesing ....>)

Chapter 9

page 324, line -2: This constraint should rather be: for all j', j'', b_{ij'k} = b_{ij''k}. (I.e., the emission probabilities are the same regardless of the To state.) (Thanks to Philip Neal Whitman <pwhitman ....>)

page 326, line 3: "This process is referred to as decoding". This sentence should be moved to section 9.3.2. It is predicting the values of hidden variables (e.g., the Viterbi algorithm) which is referred to as decoding. If anything, this procedure is referred to as "evaluation" or "density estimation". (Thanks to Joao Ricardo Silva <i22094 ....>)

page 330, table 9.2: The entries for delta in the last column are wrong. They should be:
delta_CP(4) = 0.01323
delta_IP(4) = 0.00567
The score on the final line for P(X) should then also be 0.01323. (Thanks to Michael Roettger <roettm ....> and Bill MacCartney <wcmac ....>)

page 336, middle: the reestimated output B parameter for (IP, cola) should be 0.1463 (not 0.1363). (Thanks to Mike Jahr <mjahr ....>)

Chapter 10

page 347, equation 10.7: Repeat the argmax t_{1,n} after the final equals sign. (Thanks to Lillian Lee <llee ....>)

page 349, table 10.4: for the word bear, the entries for bear as a preposition and a noun should be swapped, so that bear appears 10 times as a noun and 0 times as a preposition. This affects exercises 10.5 and 10.6.

pages 349-350: In figure 10.2 in the tagging algorithm, the words of the input sentence of length n are referred to as w2,...,wn+1 (w1 is assumed to be '.'). But the text on page 349 and 350 regards the words as being w1,...,wn. It would have been consistent if figure 10.2 and the unnumbered equations on page 349 had put the beginning period in index position 0. (Thanks to Karl-Michael Schneider <schneide ....>)

Chapter 11

page 395-396: The restriction in the sum to g ne j is unnecessary and incorrect (it would mean that for a rule A -> B B that the outside probability contribution of the left hand B would not be counted). Simply delete the "ne j" condition in all 3 equations, and delete from ", but restrict" until the colon. (Thanks to Mark Johnson <Mark_Johnson ....>)

Chapter 13

page 488, bottom equation: denominator should be \sum_v z_{v,w_e}, and the line below should read "where the summation ranges over all French words v. (Thanks to Lillian Lee <llee ....>)

Chapter 14

page 507, equation 14.3: The v_i and w_i in the denominator of the fraction should be squared, and then under the assumption of normalized vectors, this denominator is just 1 and so the whole is the same as the dot product which should be v.w. [It is probably better to look at pp. 300-301 for this.] (Thanks to Mizuki Fujisawa <mfujisa ....>)

page 508, equation 14.6: The numerator should be: (s(ci) + s(cj))(s(ci) + s(cj))-(|ci|+|cj|). (Thanks to Fernando Diaz <fern ....>)

page 523, equation (14.20): the printed version errs by both omitting the priors, and wrongly trying to do a point probability of a continuous density function. It should be: (\pi_j n_j(x_i; \Theta))/(\sum_l \pi_l n_l(x_i; \Theta)) (Thanks to Lillian Lee <llee ....>)

Chapter 15

page 553, section 15.3.5: The definition of RIDF should read "the difference between the logs of actual inverse document frequency and inverse document frequency predicted by Poisson."

RIDF = IDF - log (1/(1-p(0;lambda-i))) = IDF + log(1-p(0;lambda-i))
The example values for RIDF should be 1.29 for insurance and 0.16 for try (whoops...).

pages 560-563: There are a number of mess-ups here. You might want to look at a (hopefully!) correct version of the pages (PDF file) [updated Dec 18, 2000]. (Thanks to Elizabeth Jessup <jessup ....> and the people below)

pages 560-561, figures 15.8 and 15.10: The matrices T and D should be transposed. In the text on line 9 of p. 561, the matrices should be restricted to their first k < n columns, rather than rows. (Thanks to Fred Popowich <popowich ....>)

page 562-563, figure 15.11 and line 11: The matrix B is misdescribed. It should be B = S_{2x2} D^T_{2xd}. Entry (dim.1, d_3) should be -0.44. In equation (15.13) the final term should be replaced by: A^T A = ... = D S^T S D^T = (S D^T)^T (S D^T). You might want to look at a (hopefully!) correct version of the pages (PDF file) [updated Aug 2, 2006]. (Thanks to Graeme Hirst <gh ....> and Sean Irvine <sean ....>)

page 562-563, table 15.9: Several entries in the matrix are not correct, look here for corrected entries. In the text, the corresponding similarities on line -2, page 562, should be (0.94, 0.93, 0.75), not (0.94, 0.93, 0.74), and on line 3, page 563, should be (0.94), not (0.88). (Thanks to Tadashi Yonezaki <yonezaki ....>, Steven Lee <steven.lee ....>, and Arthur Chan <archan ....>)

Chapter 16

page 580 equation (16.1) and text below and page 582, table 16.4: Although it is nowhere mentioned, logarithms in this section are natural logarithms (calculated to the base e). Natural logarithms are used to get the term weights shown at the bottom of page 580 and in table 16.3. The entropies of this section are also thus not measured in bits but in "nats". While measuring entropies in any base is perfectly correct (if done consistently!), it's confusing since everywhere else we have used entropies measured as logs to the base 2. On page 582, in bits, H of node 1 = 0.881, H of node 2 = 0.518, H of node 3 = 0.315. (Thanks to Phil Shinn <philshinn ....> and Margalit Zabludowski <margalit ....>)

pages 589-596: Our implementation of maxent at the time the book was published was buggy. The classification results and feature weights shown, and some of the discussion based on them, are thus wrong (classification accuracy should be 96.2%). Some of the explanations could have been better too. You should look at a (hopefully!) correct revised version of the pages (Postscript file) or in PDF. (Many thanks to Martin Jansche <jansche ....> for setting us straight on this.)

Typos and minor thinko's and fixups

Chapter 1

page 11, line -13: "changes" should read "change".

Chapter 2

page 40, line -6: "written 2^F" should be "written 2^Omega" (Thanks to Kim, Sang-Beom <bewise ....>

page 42, line -2: here and passim we use "symmetric" to describe a commutative binary function (whereas the term "symmetric" is more commonly applied to binary relations). (Thanks to Tibor Kiss <tibor ....>

page 46, example 4: "\sum_1^6 x p(y)" should be "\sum_1^6 y p(y)". (Thanks to Emiel Krahmer <e.j.krahmer ....>)

page 63, equation (2.29) should end with lowercase letters in log p(x,y). (Thanks to Jan Hajic <hajic ....>)

page 69, bottom two unnumbered equations: the probability distribution (roman) p in H(p) refers to the channel probability shown in the middle of figure 2.7, not to the distribution over the input. That is H(p(y|x)). Using the notation of page 65, the righthand sides could more clearly have been represented as H(Y) - H(p, 1-p) and 1 - H(p, 1-p), where (italic) p is the parameter shown in figure 2.8. (Thanks to Taher Haveliwala <taherh .... CS.Stanford.EDU>)

Chapter 3

page 84, line 6: The final consonant is /z/ after schwa, not /s/, at least in nearly all varieties of English. (Thanks to Brian Campbell <bpcam0 ....>)

page 90, line 1: delete ")" after -ed. (Thanks to Sean Irvine <sean ....>)

page 97, (3.39) "P -> IN NP" should be "PP -> IN NP". (Thanks to Frank Wallace <faw7 ....>)

page 99, figure 3.1: The tag DT should be replaced with AT (4 times). (Thanks to Vander Alves <v.alves ....>)

page 102, line -10: "Many language" --> "Many languages". (Thanks to Sean Irvine <sean ....>)

Chapter 4

page 119: "poetry can flaunt semantic expectations" --> "poetry can FLOUT semantic expectations". (Thanks to Graeme Hirst <gh ....>)

page 135, figure 4.1 line 6: "does not normally occur word finally" --> "does not normally occur sentence finally" (Thanks to Graeme Hirst <gh ....>)

page 139, example (4.9a): "&#x43;" should be "&#x3C;" (Thanks to Chao-Lin Liu <chaolin ....>)

Chapter 5

page 157, exercise 5.3: "stong" --> "strong" (Thanks to Tom Laureys <Tom.Laureys ....>)

page 161, line 7: change "randomness" to "variability". (Thanks to John Tukey)

page 163, line -1: "tasks means the we can" --> "tasks means that we can". (Thanks to Sun Honglin <cshlsun ....>)

page 164, footnote 4: "degress" --> "degrees" (Thanks to Dino Azzano <dino.azzano ....>

page 169, table 5.8: The bottom right entry should be 14287173. This number should then also appear in place of 14287181 in the calculation below equation 5.7 on page 170, but the result remains approximately 1.55. (Thanks to Dae Hwan Kim <dhk ....>)

pages 169-171: \chi^2 vs. X^2 was rather inconsistently used, but our intention was to distinguish the \chi^2 distribution from the X^2 test statistic, which has a \chi^2 distribution (asymptotically). Page 169, line -6: replace \chi^2 with X^2. Page 170, equation 5.7 and the line below equation 5.7: replace \chi^2 with X^2. Page 171 lines 5 and 6: replace \chi^2 with X^2.

page 173, equation 5.10: the b(.)'s here should be in roman type as the same binomial distribution shown in (5.9). Added that for the New York Times corpus that N = 14,307,668. The b(.,.,.) terms in the equation should be b(.;.,.) to be consistent with the usage above and elsewhere in the book. (Thanks to Mark-Jan Nederhof <markjan ....>)

page 175, line 1: reference to (5.11) should be to (5.10) (Thanks to Mark-Jan Nederhof <markjan ....>)

page 180, line 5: "Chambre de communes" --> "Chambre des communes" (Thanks to Gwillim Law <glaw ....>)

page 184, line 17: "for for" --> "for" (Thanks to Lillian Lee <llee ....>)

page 188, line -10: "International Bureau Machines" should read "International Business Machines"!

Chapter 6

page 204, line 16: "two much" --> "too much". (Thanks to Mark Lewellen <lewellen ....> and Jeremy Pickens <jeremy ....>

page 206, line 9: "bigrams" should be "n-grams". (Thanks to Chao-Lin Liu <chaolin ....>)

page 213, line 6: "estimate A and b" should be "estimate a and b". (Thanks to Tibor Kiss <tibor ....>

page 213, line 8: "log Nr = a + b log r" should be "log Nr = log a + b log r". (Thanks to Yiching Wu <w898702 ....>

Chapter 7

page 242, footnote 4: "Lawrence" --> "Laurence" (Thanks to Graeme Hirst <gh ....>)

Chapter 8

page 273, line 3: "indicate" --> "indicates" (Thanks to Lillian Lee <llee ....>)

page 274, line 10: "error rate for cue f^j" --> "error rate for cue c^j" (Thanks to Hoojung CHUNG <hjchung ....>)

page 279, line -5: "with the spoon" --> "with a spoon" (Thanks to Hoojung CHUNG <hjchung ....>)

page 282, equation 8.21: "P(NAp=0|v)" should be P(NAp=0|n) (Thanks to JooYoung Lee <jylee ....>)

page 283, line 4: "equation (8.22)" --> "equation (8.21)"

page 284, line 5: "formula (8.22)" --> "formula (8.21)"

page 295, line -8: "misions" --> "missions" (Thanks to Hidetosi Sirai <sirai ....>)

page 299, line 2: "X intersected with Y" --> "|X intersected with Y|" (Thanks to Martin Jansche <jansche ....>)

page 301, line 5, 8.41: The first x is missing a vector. (Thanks to Martin Jansche <jansche ....>)

page 303, line -6: Replace both instances of "row" by "column". (Probability distributions can of course be constructed using either rows or columns, but the example immediately below uses columns. (Thanks to Andy Kacsmar <andy .... DB.Stanford.EDU>)

page 313, line 5: "we we" --> "when we".

Chapter 10

page 342, line 15: "the word to" --> "the infinitive marker to"

page 347, line 3: replace comma with colon in numerator (Thanks to Adam Przepiorkowski <adamp ....>)

page 347, line -8: replace sentence in parentheses with: "To simplify our notation, we assume that t0 = PERIOD precedes the sentence."

page 347, equation 10.5: "P(t_n - 1 | t_{n-2})" should be "P(t_{n-1} | t_{n-2})". (Thanks to Lee, Do-Gil <dglee ....>)

page 348, Figure 10.1, line 8: replace comma with colon in numerator (Thanks to Adam Przepiorkowski <adamp ....>)

page 348, exercise 10.5: "shows" --> "shown".

page 350, line -12: "9.3.2" --> "9.3". (Thanks to Jim Thanos <jim_thans ....>)

page 355: first unnumbered equation: "P(t^j|t^{j+1)" should be "P(t^j|t^{j-1})". (Thanks to Lee, Do-Gil <dglee ....>)

page 355, numerator of second unnumbered equation: reverse order of arguments of "C", replace comma with colon

page 356, under example 10.11: the references to "P(a)", etc. should have the "a", "b", "c" in roman, referring to the items in 10.11.

Chapter 11

page 390, line -6 (unnumbered display): The node above box should be labeled N0 to be consistent with (11.13) on page 391. (Thanks to Kemal Oflazer <oflazer ....>)

page 400, unnumbered equation between (11.26) and (11.27): This should begin with E as an expectation, not P. (Thanks to Luc De Raedt <Luc.DeRaedt ....>)

Chapter 12

page 436, line 12: delete "unfortunately".

page 442, line 19: "see below" --> "see section 12.2.2".

page 444, line 11: add ")" after "cf. section 8.3".

page 454, final line above section 12.2.4: Delete ")" (Thanks to Lillian Lee <llee ....>)

page 458, (12.36j): Change "V" to "VBZ" (Thanks to Barbara Plank <bplank ....>)

Chapter 13

page 475, line 5: "alignments" --> "alignment". (Thanks to Lillian Lee <llee ....>)

page 481, middle: change L to m_A in the unnumber arg max equation and twice in the line below: both L and m_A were referring to the number of beads in the aligned texts. (Thanks to Lillian Lee <llee ....>)

Chapter 14

page 502, figure 14.2, line 7: "C" --> "|C|"; line 9: insert ":" before "=". figure 14.3, line 8: insert ":" before "=". (Thanks to Martin Jansche <jansche ....>)

page 520, line 19: change subset or equal sign to proper subset sign: X cannot be equal to R^m since it is a finite subset. (Thanks to Stephen Cronen-Townsend <crotown ....>)

page 520, equation (14.17): the 2nd argument of the function on the left hand side should be mu_j, not m_j -- as in equation (14.18) below. (Thanks to Lillian Lee <llee ....>)

page 520, equation (14.18): n should be n_j. (The indices on n here are just for identification, note). (Thanks to Lillian Lee <llee ....>)

page 521, table 14.5, header line 1: \mu_j should have a vector sign above it, and should be separated from \Sigma_j by a comma. (Thanks to Lillian Lee <llee ....>)

page 524, unnumbered equation defining Q: add extra open parenthesis before X, so one has "E(l((X,Z)|Theta)|X,Theta^k)". (Thanks to Lillian Lee <llee ....>)

Chapter 15

page 541, equation 15.3: move number to top of displayed equations.

page 542, line -9: add "relative" before "importance". (Thanks to John Tukey)

page 544, section 15.2.2, table 15.5: The entry for n in the column "Document frequency" should be: n (none)   1.

page 553, line -5 and unnumbered equation above it: For consistency, with the Poisson distribution parameter, we should have RIDFi, IDFi, and dfi. (Thanks to Piotr Gawrysiak <pgawrysiak ....>

page 571, line 4: "Lewis and Jones" --> "Lewis and Sparck Jones". (Thanks to Lillian Lee <llee ....>)

Chapter 16

pages 578, figure 16.1: The test for the bottom right node should be "vs ≥ 2". (Thanks to Barbara Plank <b.plank ....>)

page 584, line 4: add a comma after "set". (Thanks to Lillian Lee <llee ....>)

page 593, third bullet: "go to 2" should say "go to second bullet: Update the parameters ...". (In the 2nd printing the bullets are replaced with numbered items.) (Thanks to Martin Jansche <jansche ....>)

page 599, first (unnumbered) displayed equation: add subscript "j" to "x-vector". (Here we are describing evaluating the decision procedure on the training instances so as to adjust the weights and threshold, but the same decision procedure is used on test instances.). (Thanks to Lillian Lee <llee ....>)

page 607, line 4: "Yang (1998)" --> "Yang (1999)".


page 628, line -10: "Ralph Grishman Claudia Gdaniec" --> "Claudia Gdaniec, Ralph Grishman". (Thanks to Lillian Lee <llee ....>)

page 636, line 8: Lari and Young 1991 title should begin "Applications of stochastic context free grammars". (Thanks to Lillian Lee <llee ....>)

page 654, middle: Correct the entry for Yang (1998) to:
Yang, Yiming. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval 1: 69-90.


About a dozen index entries were added in the second printing

Christopher Manning and Hinrich Schütze -- last modified