Ling 289: Quantitative and Probabilistic Explanation in Linguistics
Fall 2007 Syllabus

Course Syllabus

(updated 2007/11/26)

This is a tentative syllabus and is subject to change (hit reload!).





Week 1




24 Sep 07

Introduction: Motivation of probabilistic and statistical approaches.



Linguistics: What motivates probabilistic approaches and statistical methodology in linguistics? Problems of categoricity. The greater explanatory power of probabilistic models. Some examples.


R. H. Baayen. forthcoming Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press, chapter 1.

Peter Dalgaard. 2002. Introductory Statistics with R, chapter 7. [A good intro to R!]

Tree pattern matching programs and tregex/tgrep/R exercise

Tregex Patterns

Supplemental readings:

Steven Abney. 1996. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Cambridge, MA: MIT Press. [online Stanford access]

Christopher D. Manning. 2003. Probabilistic Syntax. In Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds.), Probabilistic Linguistics, MIT Press, 2003.



Wedneday, 26 Sep 07

Introduction to tregex/tgrep2 and R.

HW #1



John Goldsmith. 2007. Probability for linguists. MS, U. Chicago. pdf

Rice, John A. Mathematical Statistics and Data Analysis. 2nd edition. Duxbury Press, 1995, chapter 1.

Supplemental reading:

Christopher Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Chapter 2, pp. 39-54, 60-68, 72-76.



Week 2



Monday, 1 Oct 07

Basic concepts in probability; statistics on contingency tables; the idea of hypothesis testing



Statistics: Probability intro: counting, basic probability laws, maximum likelihood; discrete and continuous distributions. Independence. Contingency table data:  The chi-squared test. Fisher's exact test.


Dalgaard, Peter. 2002. Introductory Statistics with R. Springer.

Gruesomely detailed reference reading:

Agresti, Alan. 2002, Categorical Data Analysis. 2nd edition. Wiley. Ch. 1-3.



Wednesday, 3 Oct 07

Grammatical weight and ambiguity avoidance.

 HW #2

 HW #1


The principle of end weight; gradient phenomena.

Statistics: Exploratory Data Analysis (EDA). R Graphics. Box plots. Scatter plots.


Wasow, Thomas Postverbal Behavior. CSLI Publications. 2002. Chapter 2.

R. H. Baayen. forthcoming Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press, chapter 2.



Week 3



Monday, 8 Oct 07

Continued discussion of end weight. Probabilistic grammars.  Constructing models and examining their goodness of fit.  Comparing models.




Suppes, Patrick. 1970. Probabilistic grammars for natural languages. Synthèse 22: 95-116. Here's spreadsheet for Table 1 of Suppes.

Statistics: Bernoulli, binomial, and multinomial distributions. Samples and statistical inference, estimating parameters, the method of maximum likelihood, maximum likelihood for multinomial cell probabilities. Tests for proportions or odds. Likelihood ratios: log odds ratios, and the G2 test.

Some notes on contingency table statistics.

allege/assume vs. that/zero statistics spreadsheet.

Supplemental readings:

Roland and Jurafsky. Verb Sense and Verb Subcategorization Probabilities. CUNY 1998. 



Wednesday, 10 Oct 07

Continued discussion of Suppes. Linear regression models. Gradience in grammaticality.  Magnitude estimation.

 HW #3

 HW #2

Linguistics: Magnitude Estimation for linguistic data

Bard, Ellen Gurman, Robertson, Dan, and Sorace, Antonella. 1996. Magnitude Estimation of Linguistic Acceptability. Language 72: 32-68. That link only works on campus. You can also get to it off campus by starting here and getting it from JSTOR or ProjectMuse, or you can get it using the class password here. Tyler.

Statistics: Mean, median, and variance. Linear regression: simple and multiple linear regression. The idea of building probabilistic models for linguistic explanation.

Keith Johnson's site/draft textbook. In particular, chapter 3 discusses the EMMA data set, and here is the chaindata.txt EMMA data set. Here are the kinds of commands I used for basic and multiple linear regressions, now including checking for linearity and homoscedasticity.

Supplemental Readings:

Sorace, A. (2000)."Gradients in auxiliary selection with intransitive verbs". Language 76: 859-890.

Keller, Frank and Antonella Sorace. 2003. Gradient Auxiliary Selection and Impersonal Passivization in German: An Experimental Investigation. Journal of Linguistics 39:1, 57-108.

Keller, Frank and Ash Asudeh. 2001. Constraints on Linguistic Coreference: Structural vs. Pragmatic Factors. In Johanna D. Moore and Keith Stenning, eds., Proceedings of the 23rd Annual Conference of the Cognitive Science Society, 483-488. Mahawah, NJ: Lawrence Erlbaum.



Week 4



Monday, 15 Oct 07

Logistic regression models of systemic choice



Statistics: Logistic regression

R. H. Baayen. forthcoming Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press, chapter 6. [The ratings data set that he begins the chapter with was also discussed in Sections 2.2-2.4 and 4.3-4.4 of the book. You may need to look back at those sections to familiarize yourself with what the data is about.]

Supplemental readings:

Sankoff, D. 1988. Variable rules. In U. Ammon, N. Dittmar, and K. J. Mattheier (eds.), Sociolinguistics: An International Handbook of the Science of Language and Society. Vol.2, pp. 984-997. Berlin: Walter de Gruyter.

Fred L. Ramsey and Daniel W. Schafer. 1997. The Statistical Sleuth: A Course in Methods of Data Analysis. Belmont, CA: Duxbury Press, chapter 20, pp. 564-583.

Labov, William. 1969. Contraction, deletion and inherent variability of the English copula. Language 45, 715-62, extract.

Linguistics: A case study on a sociolinguistics dataset (Cedergren's 1973 study of final /s/ deletion in Panamanian Spanish)



Wednesday, 17 Oct 07

Logistic regression modeling for syntax

 HW #4

 HW #3

The prehistory and constraint interactions.


Active vs. passive variation. Modeling the choice with logistic regression (a.k.a. Varbrul)

E. Judith Weiner and William Labov. 1983. Constraints on the agentless passive. Journal of Linguistics 19: 29-58.. Nic.

Homework 4:

Homework 4 instructions

Cedegren's Panamanian Spanish Varbrul instructions

Cedegren's Panamanian Spanish data - in a format suitable for use in R:
cedegren <- read.table("cedegren.txt", header=T)

Cedegren's Panamanian Spanish data in long format - suitable for use in R with the Design package:
ced.long <- read.table("cedegren-long.txt", header=T)

Linguistics and statistics:

Robert Sigley. 2003. The importance of interaction effects.  Language Variation and Change.


(HW#4: see items under the column with subjects and readings...)


Week 5



Monday, 22 Oct 07

Chris is away






Wednesday, 24 Oct 07

Chris is away






Week 6



Monday, 29 Oct 07

Logistic regression and maximum entropy models. Stochastic Phonotactics.


 HW #4

Statistics: Generalized linear models. The connection between logistic regression and maximum entropy models.

Linguistics. Stochastic phonotactics.

Bruce Hayes and Colin Wilson. To appear. A maximum entropy model of phonotactics and phonotactic learning. Draft of Aug 2007. To appear in Linguistic Inquiry.


Talk to Chris about final project!

Wednesday, 31 Oct 07

Chris is away (again!)






Week 7



Monday, 5 Nov 07

Logistic regression in great detail. Part I.



Statistics: More on logistic regression
Logistic regression notes done on the Cedegren data
Miscellaneous R notes



Wednesday, 7 Nov 07

Logistic Regression in great detail. Part II.


Project outline

Statistics: Logistic regression: interaction terms, G2 log likelihood ratios, summary and long form data, etc.
Same readings as monday.



Week 8



Monday, 12 Nov 07

ANOVA models and constituent ordering revisited.



Statistics: Analysis of Variance (ANOVA)


Arnold, Jennifer, Thomas Wasow, Ash Asudeh, and Peter Alrenga. Avoiding Attachment Ambiguities: the role of Constituent Ordering. Journal of Memory and Language 55.1: 55-70. 2004. Sven.



Wednesday, 14 Nov 07

Mixed effects models.



Mainly statistics:

T. Florian Jaeger. 2007. Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Draft, 2007. [local copy]

Frank Harrell. 2004-2007. Problems Caused by Categorizing Continuous Variables. (See also the linked to Java applet. It's neat.)



Monday, 19 Nov 07







Wednesday, 21 Nov 07







Week 9



Monday, 26 Nov 07

More on mixed effects models.



Baayen, R.H., Davidson, D.J. and Bates, D.M. (submitted). Mixed-effects modeling with crossed random effects for subjects and items. MS, 2007.

Herbert H. Clark. 1973. The Language-as-Fixed-Effect Fallacy: A Critique of Language Statistics in Psychological Research. Journal of Verbal Learning and Verbal Behavior 12: 335-359.

R. H. Baayen. forthcoming Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press, sections 7.0-7.3.



Wednesday, 28 Nov 07

Exploratory data analysis, logistic regression and logistic mixed effects models for syntax




Generalized Linear Mixed Models notes

Classification accuracy. Evaluating model fit.

R. H. Baayen. forthcoming Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press, section 7.4.


Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Harald Baayen. 2007.  Predicting the Dative Alternation.   In Cognitive Foundations of Interpretation, ed. by G. Boume, I. Kraemer, and J. Zwarts.  Amsterdam: Royal Netherlands Academy of Science, pp. 69--94. 33 pages.



Week 10



Monday, 3 Dec 07

s-genitives and the of-genitive in English. Determining systemic choices: Optimality Theory, Stochastic Optimality Theory, and Logistic regression. Model comparisons.




Gerhard Jäger and Anette Rosenbach. 2004.  The winner takes it all - almost. Cumulativity in grammatical variation. MS, University of Potsdam and University of Düsseldorf. Rob.

Rob's handout on Jäger and Rosenbach

Supplemental readings:

Stochastic optimality theory. Boersma and Hayes intro of Empirical tests of the Gradual Learning Algorithm, Linguistic Inquiry32: 45-86.

Joan Bresnan, Shipra Dingare, and Christopher D. Manning. Soft Constraints Mirror Hard Constraints: Voice and Person in English and Lummi. Proceedings of the LFG01 Conference, pp. 13-32, Hong Kong.

Lin, Ying. 2005. Learning Stochastic OT: a Bayesian approach using Data Augmentation and Gibbs sampling. ACL 05. [This person really worked out learning of Stochastic OT grammar properly. His web page is here.]

Anette Rosenbach. 2003. Aspects of iconicity and economy in the choice between the s-genitive and the of-genitive in English. In B. Mondorf and G. Rohdenburg (eds), Determinants of Grammatical Variation in English. Mouton de Gruyter, pp. 379-411.

Anette Rosenbach. 2005. Animacy versus weight as determinants of grammatical variation in English. Language 81(3): 613-644.

Altenberg, Bengt. 1982. The Genitive v. the of-Construction: A study of syntactic variation in 17th century English. Lund Studies in English 62. Lund: Gleerup.

Benedikt Szmrecsanyi and Lars Hinrich. submitted. Probabilistic determinants of genitive variation in spoken and written English: a multivariate comparison across time, space, and genres

Per Boberg. 2007. The inflected genitive and the of-construction: A comparative corpus study of written East African, Indian, American and British English.



Wednesday, 5 Dec 07

Empirical probabilistic syntax. Back from the precipice.



Linguistics: Empiricism in linguistics.

Goldsmith, John. 2007. Towards a new empiricism: 1.6.  MS, U. Chicago. Uriel.

Statistics: Bayes' Rule, Bayesian statistics, and the MDL principle.



Week 11 (i.e., we won't get to this!)











 Model comparisons: Decision tree or so-called "analogic" models.




Ernestus, Mirjam Theresia Constantia, and Harald R. Baayen. 2004.  Predicting the Unpredictable: Interpreting Neutralized Segments in Dutch. Language 79(1).



The End


Final paper