LSA 354
Seven Lectures on Statistical Parsing
LSA 2007 Linguistic Institute

COURSE INFORMATION
Instructor
Chris Manning, manning@cs.stanford.edu Office: Gates 158. Office hours: Thu 10-12.
Time
T/F 8:00-9:45 AM. (I, of course, apologize for the time, which isn't the one I would have chosen either.)
Location
Hewlett 101. (That's the SEQ Teaching Center. Not the huge room.)
Reading
There is no assigned text. However, if you want some good background to get you started look at: Otherwise, we'll make good use of the lecture handouts and free, online conference papers.
Description
Over the last decade, statistical parsing has transformed our ability to produce automatic, high-accuracy parses of arbitrary human language text. This course aims to teach from the basics up to the state-of-the-art in this domain. It will begin by reviewing the phenomena that motivated statistical approaches to parsing, context-free grammars (CFGs), and probabilistic CFGs. Next it will present basic parsing algorithms, concentrating on generalized CKY and A* parsing algorithms, and discuss treebanks, their design and nature, and the methods of building and evaluating parsers based on them. The course will then turn to the well-known and successful Collins and Charniak generative parsing models of the late 1990s, and discuss issues such as smoothing, head lexicalization, engineering for efficiency, and what kinds of information parsers use and need. Finally, we will turn to discriminative methods of parsing, and discuss both parse re-ranking techniques and the direct construction of discriminative parsers.
Prerequisites
Prerequisites: Reasonable familiarity and competence with mathematical notation, probability, algorithmic thinking, and programming. (We're just going to dive into statistical parsing assuming that you already know about empirical computational linguistics, probabilities, and algorithms. If you should be taking LSA 325, I hope you are!)
Required Work
If enrolled for credit, you need to complete a project for this class! You should do one of the following:
  1. The Stanford cs224n parsing assignment. This is highly recommended if you don't have previous experience writing a statistical parser. But it's plenty of work. The project is to be completed in Java -- we provide a fair bit of starter code.
  2. Writing a parser within NLTK (in Python). One obvious hole in NLTK's coverage is that there is no dependency parser. But there are various other options, such as extending the bottom up PCFG chart parser to be an A* parser.
  3. A written problem set.
  4. Adapt an existing parser to work with another language or treebank (this may be very little or considerable work), and then to do some substantive experimentation on how differences in the annotation scheme or language affect parser performance, and what could be done to improve it, and to write some stuff up about that.
  5. Another project with the agreement of the instructor.

You should also look through the first two pages of the Stanford cs224n parsing assignment for a bit of introductory information on the computing environment at Stanford: machines you can log in to, where the Penn Treebank lives, etc.


SCHEDULE
Date
HW
Lec

Topic and Readings

Fri July 6
1.

Overview of Course, Parsing and Statistical Parsing [ppt] [pdf]

Motivation for statistical parsing, recursive phrase structure. Attachment decisions and probabilities. The Penn Treebank. Top-down, and bottom-up parsing.

Tue July 10
Assignments announced.
2.

PCFGs and the CKY algorithm [ppt] [pdf]

PCFGs. Grammar transformations. Recursive parsing and memoization. Dynamic programming for parsing: the CKY algorithm.

Fri July 13
[Black Friday!]
3.

Generalized CKY parsing and unlexicalized parsing [ppt] [pdf]

Unaries and empties. Parser evaluation. Improving the context-freedom assumptions of grammars: accurate unlexicalized parsing.

Tue July 17
4.

Search in parsing and lexicalized probabilistic parsing [ppt] [pdf]

Beam parsing. Agenda-based (chart) parsing. A* parsing. Lexicalized probabilistic context-free grammars: The Charniak (1997) model.

Fri July 20
5.

Treebanks and statistical parsing [ppt] [pdf]

Lexicalized parsing: Collins (1997/1999). The status of information in treebanks.

Tue July 24
Assignments due
6.

Multilingual parsing and dependency parsing [ppt] [pdf]

Fri July 27
7.

Discriminative parsing [ppt] [pdf]

An introduction to discriminative parsing. Features in discriminative parsers, presented using some of Mark Johnson's slides on discriminative reranking.

Musical finale.