Unsupervised parsing is the task of inducing syntactic structure from text, producing parse trees for input sentences and also a grammar — rules and their probabilities — that can be used to parse previously unseen data. Although parsing is used in practically every NLP system, supervised parsers are limited to a handful of languages and genres, and can be brittle out-of-domain. Unsupervised parsing is thus an important research direction. This kind of structure learning is a hard basic research problem: after decades of efforts, objective performance evaluation numbers are still unacceptably low, at least according to supervised metrics. Grammar induction thus has all the markings of a classic artificial intelligence challenge — a task that nearly all humans master early on, but in which computers cannot (yet) compete. Consequently, advances in this field could facilitate progress towards strong AI and yield understanding of how computers — and possibly humans — can learn in the absence of explicit feedback.
One way to guide a language learning process is by starting from very short sentences, whose parse trees can be easily guessed, then gradually incorporating longer inputs. The intuition here is that (ideally) sentences of length one would be mostly root verbs (e.g., Hurry! or Stop!); length two would expose basic predicate-argument structure (e.g., Sit down!, Excerpts follow: or It happens.) or adjective/determiner-noun interactions (e.g., Korean agency., The editor. or No way!); and so on, building up to more sophisticated constructions at longer lengths. The idea of scaffolding on data complexity can be traced back to J.L. Elman's (1993) seminal work on training neural networks — starting small — and even to B.F. Skinner's (1938) conditioning procedure called shaping, in which animals are trained to perform simple subtasks before moving on to complete, complex feats.
The space of possible hidden structures of longer sentences can be constrained using various annotations that naturally occur in text. For instance, web markup strongly correlates with constituents — particularly nominals, such as the headings, entity names and titles mentioned on this page: delimited noun phrases include the bolded bigrams unsupervised parsing and grammar induction; an example of a hyper-linked verb phrase is starting small. It turns out that these signals are sufficiently accurate to be used as partial bracketings, which can in turn guide unsupervised learning of probabilistic grammars, as suggested in the influential paper of F. Pereira and Y. Schabes (1992). Other written cues — such as capitalization and punctuation — also align well with syntactic structure in many languages, simplifying the grammar induction challenge.
A motivational excerpt from The Wall Street Journal (notice how every single punctuation mark happens to line up with constituent boundaries): [SBARAlthough it probably has reduced the level of expenditures for some purchasers], [NPutilization management] — [PPlike most other cost containment strategies] — [VPdoesn't appear to have altered the long-term rate of increase in health-care costs], [NPthe Institute of Medicine], [NPan affiliate of the National Academy of Sciences], [VPconcluded after a two-year study].
Please feel free to e-mail vals -at- <university> -dot- edu with any comments or questions.