Unsupervised Parsing and Grammar Induction

Overview

Unsupervised parsing is the task of inducing syntactic structure from text, producing parse trees for input sentences and also a grammar — rules and their probabilities — that can be used to parse previously unseen data. Although parsing is used in practically every NLP system, supervised parsers are limited to a handful of languages and genres, and can be brittle out-of-domain. Unsupervised parsing is thus an important research direction. This kind of structure learning is a hard basic research problem: after decades of efforts, objective performance evaluation numbers are still unacceptably low, at least according to supervised metrics. Grammar induction thus has all the markings of a classic artificial intelligence challenge — a task that nearly all humans master early on, but in which computers cannot (yet) compete. Consequently, advances in this field could facilitate progress towards strong AI and yield understanding of how computers — and possibly humans — can learn in the absence of explicit feedback.

Bootstrapping: “Baby Steps”

One way to guide a language learning process is by starting from very short sentences, whose parse trees can be easily guessed, then gradually incorporating longer inputs. The intuition here is that (ideally) sentences of length one would be mostly root verbs (e.g., Hurry! or Stop!); length two would expose basic predicate-argument structure (e.g., Sit down!, Excerpts follow: or It happens.) or adjective/determiner-noun interactions (e.g., Korean agency., The editor. or No way!); and so on, building up to more sophisticated constructions at longer lengths. The idea of scaffolding on data complexity can be traced back to J.L. Elman's (1993) seminal work on training neural networks — “starting small” — and even to B.F. Skinner's (1938) conditioning procedure called “shaping,” in which animals are trained to perform simple subtasks before moving on to complete, complex feats.

Divide-and-Conquer: Markup

The space of possible hidden structures of longer sentences can be constrained using various annotations that naturally occur in text. For instance, web markup strongly correlates with constituents — particularly nominals, such as the headings, entity names and titles mentioned on this page: delimited noun phrases include the bolded bigrams unsupervised parsing and grammar induction; an example of a hyper-linked verb phrase is starting small. It turns out that these signals are sufficiently accurate to be used as partial bracketings, which can in turn guide unsupervised learning of probabilistic grammars, as suggested in the influential paper of F. Pereira and Y. Schabes (1992). Other written cues — such as capitalization and punctuation — also align well with syntactic structure in many languages, simplifying the grammar induction challenge.

A motivational excerpt from The Wall Street Journal (notice how every single punctuation mark happens to line up with constituent boundaries): [_{_SBAR}Although it probably has reduced the level of expenditures for some purchasers], [_{_NP}utilization management] — [_{_PP}like most other cost containment strategies] — [_{_VP}doesn't appear to have altered the long-term rate of increase in health-care costs], [_{_NP}the Institute of Medicine], [_{_NP}an affiliate of the National Academy of Sciences], [_{_VP}concluded after a two-year study].

Students

Faculty

Alumni

Theses

V.I. Spitkovsky. Grammar Induction and Parsing with Dependency-and-Boundary Models. Computer Science, December 2013 [pdf, bib; slides]
D. Klein. The Unsupervised Learning of Natural Language Structure. Computer Science, March 2005 [pdf, bib]

Papers

V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction. EMNLP 2013 [pdf, bib; slides]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models. ICGI 2012 [pdf, bib; slides]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Three Dependency-and-Boundary Models for Grammar Induction. EMNLP-CoNLL 2012 [pdf, bib; poster]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Capitalization Cues Improve Dependency Grammar Induction. NAACL HLT: WILS 2012 [pdf, bib; slides]
V.I. Spitkovsky, H. Alshawi, A.X. Chang, and D. Jurafsky. Unsupervised Dependency Parsing without Gold Part-of-Speech Tags. EMNLP 2011 [pdf, bib; poster, data]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction. EMNLP 2011 [pdf, bib; poster]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Punctuation: Making a Point in Unsupervised Dependency Parsing. CoNLL-2011 [pdf, bib; slides]
V.I. Spitkovsky, H. Alshawi, D. Jurafsky, and C.D. Manning. Viterbi Training Improves Unsupervised Dependency Parsing. CoNLL-2010 [pdf, bib; slides]
V.I. Spitkovsky, D. Jurafsky, and H. Alshawi. Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing. ACL 2010 [pdf, bib; slides, data]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing. NAACL HLT 2010 [pdf, bib; slides, data]
V.I. Spitkovsky, H. Alshawi, and D. Jurafsky. Baby Steps: How “Less is More” in Unsupervised Dependency Parsing. NIPS: GRLL 2009 [pdf, bib; poster, data]
D. Klein and C.D. Manning. Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency. ACL 2004 [ps, pdf]
D. Klein and C.D. Manning. A Generative Constituent-Context Model for Improved Grammar Induction. ACL 2002 [ps, pdf]
D. Klein and C.D. Manning. Natural Language Grammar Induction using a Constituent-Context Model. NIPS 2001 [ps, pdf]
D. Klein and C.D. Manning. Distributional Phrase Structure Induction. CoNLL-2001 [ps, pdf]

Contact Information

Please feel free to e-mail vals -at- <university> -dot- edu with any comments or questions.