The
Unsupervised Language Learning
Project Page
Overview
Humans are able to acquire
linguistic knowledge in a more or less unsupervised manner. Although
machines lack the contextual situation of a human learner, as well as
whatever innate knowledge humans might have, much of the structure of
natural language is distributionally detectable. The more linguistic
structure that can be automatically learned, the less need there is
for large marked-up corpora, which are costly in both time and
expertise.
The primary focus is on grammar induction, which aims to find the
hierarcical structure of natural language. Grammar search methods
have met with little success, and simple distributional approaches
that work for part-of-speech induction do not directly apply. For
example, differentiating noun phrases, verb phrases, and prepositional
phrases requires discovering the three clusters in the left plot --
which seems easy enough. However, deciding which sequences are units
at all requires telling apart the red and blue clusters on the left --
which is much harder.
|
|
|
Labeling constituents is easy.
|
Finding constituents is hard.
|
However, using a constituent-context model, which essentially allows
distributional clustering in the presence of no-overlap constraints,
we can successfully recover a substantial amount of hierarchical
structure, even with just a few thousand training sentences. Our system gives the best published results for unsupervised parsing of the ATIS corpus, and, in particular, is the first system to beat right-branching structures in F1.

Induced trees are substantially better than even the supervised right-branching baseline.
Publications
- Dan Klein and Christopher D. Manning, "A Generative
Constituent-Context Model for Improved Grammar Induction",
Proceedings of the 40th Annual Meeting of the ACL,
2002. [ps]
[pdf]
[bib]
- Dan Klein and Christopher D. Manning, "Natural Language Grammar
Induction Using a Constituent-Context Model", Advances in Neural Information Processing Systems 14 (NIPS-2001),
2001. [ps]
[pdf]
[bib]
- Dan Klein and Christopher D. Manning, "Distributional Phrase
Structure Induction", Proceedings of the Fifth Conference on
Natural Language Learning (CoNLL-2001), 2001. [ps]
[pdf]
[bib]
Contact Information
Comments about the project page? Feel free to email Dan. I'm still looking for a good acronym. If anyone has any ideas on how to turn "Unsupervised Language Learning" into "PHOENIX", let me know!