LSA 354
Seven Lectures on Statistical Parsing
LSA 2007 Linguistic Institute

Chris Manning, Office: Gates 158. Office hours: Thu 10-12.
T/F 8:00-9:45 AM. (I, of course, apologize for the time, which isn't the one I would have chosen either.)
Hewlett 101. (That's the SEQ Teaching Center. Not the huge room.)
There is no assigned text. However, if you want some good background to get you started look at: Otherwise, we'll make good use of the lecture handouts and free, online conference papers.
Over the last decade, statistical parsing has transformed our ability to produce automatic, high-accuracy parses of arbitrary human language text. This course aims to teach from the basics up to the state-of-the-art in this domain. It will begin by reviewing the phenomena that motivated statistical approaches to parsing, context-free grammars (CFGs), and probabilistic CFGs. Next it will present basic parsing algorithms, concentrating on generalized CKY and A* parsing algorithms, and discuss treebanks, their design and nature, and the methods of building and evaluating parsers based on them. The course will then turn to the well-known and successful Collins and Charniak generative parsing models of the late 1990s, and discuss issues such as smoothing, head lexicalization, engineering for efficiency, and what kinds of information parsers use and need. Finally, we will turn to discriminative methods of parsing, and discuss both parse re-ranking techniques and the direct construction of discriminative parsers.
Prerequisites: Reasonable familiarity and competence with mathematical notation, probability, algorithmic thinking, and programming. (We're just going to dive into statistical parsing assuming that you already know about empirical computational linguistics, probabilities, and algorithms. If you should be taking LSA 325, I hope you are!)
Required Work
If enrolled for credit, you need to complete a project for this class! You should do one of the following:
  1. The Stanford cs224n parsing assignment. This is highly recommended if you don't have previous experience writing a statistical parser. But it's plenty of work. The project is to be completed in Java -- we provide a fair bit of starter code.
  2. Writing a parser within NLTK (in Python). One obvious hole in NLTK's coverage is that there is no dependency parser. But there are various other options, such as extending the bottom up PCFG chart parser to be an A* parser.
  3. A written problem set.
  4. Adapt an existing parser to work with another language or treebank (this may be very little or considerable work), and then to do some substantive experimentation on how differences in the annotation scheme or language affect parser performance, and what could be done to improve it, and to write some stuff up about that.
  5. Another project with the agreement of the instructor.

You should also look through the first two pages of the Stanford cs224n parsing assignment for a bit of introductory information on the computing environment at Stanford: machines you can log in to, where the Penn Treebank lives, etc.


Topic and Readings

Fri July 6

Overview of Course, Parsing and Statistical Parsing [ppt] [pdf]

Motivation for statistical parsing, recursive phrase structure. Attachment decisions and probabilities. The Penn Treebank. Top-down, and bottom-up parsing.

Tue July 10
Assignments announced.

PCFGs and the CKY algorithm [ppt] [pdf]

PCFGs. Grammar transformations. Recursive parsing and memoization. Dynamic programming for parsing: the CKY algorithm.

Fri July 13
[Black Friday!]

Generalized CKY parsing and unlexicalized parsing [ppt] [pdf]

Unaries and empties. Parser evaluation. Improving the context-freedom assumptions of grammars: accurate unlexicalized parsing.

Tue July 17

Search in parsing and lexicalized probabilistic parsing [ppt] [pdf]

Beam parsing. Agenda-based (chart) parsing. A* parsing. Lexicalized probabilistic context-free grammars: The Charniak (1997) model.

Fri July 20

Treebanks and statistical parsing [ppt] [pdf]

Lexicalized parsing: Collins (1997/1999). The status of information in treebanks.

Tue July 24
Assignments due

Multilingual parsing and dependency parsing [ppt] [pdf]

Fri July 27

Discriminative parsing [ppt] [pdf]

An introduction to discriminative parsing. Features in discriminative parsers, presented using some of Mark Johnson's slides on discriminative reranking.

Musical finale.