The
Unsupervised Language Learning
Project Page



Overview

Humans are able to acquire linguistic knowledge in a more or less unsupervised manner. Although machines lack the contextual situation of a human learner, as well as whatever innate knowledge humans might have, much of the structure of natural language is distributionally detectable. The more linguistic structure that can be automatically learned, the less need there is for large marked-up corpora, which are costly in both time and expertise.

The primary focus is on grammar induction, which aims to find the hierarcical structure of natural language. Grammar search methods have met with little success, and simple distributional approaches that work for part-of-speech induction do not directly apply. For example, differentiating noun phrases, verb phrases, and prepositional phrases requires discovering the three clusters in the left plot -- which seems easy enough. However, deciding which sequences are units at all requires telling apart the red and blue clusters on the left -- which is much harder.

Labeling constituents is easy. Finding constituents is hard.

However, using a constituent-context model, which essentially allows distributional clustering in the presence of no-overlap constraints, we can successfully recover a substantial amount of hierarchical structure, even with just a few thousand training sentences. Our system gives the best published results for unsupervised parsing of the ATIS corpus, and, in particular, is the first system to beat right-branching structures in F1.


Induced trees are substantially better than even the supervised right-branching baseline.


Publications


Contact Information

Comments about the project page? Feel free to email Dan. I'm still looking for a good acronym. If anyone has any ideas on how to turn "Unsupervised Language Learning" into "PHOENIX", let me know!