At the bottom you can find a blurb on my research which I wrote when I first came to Stanford (1999). It's not totally irrelevant, but I never get around to updating it, so it's better to look at talks and papers to see what I'm actually doing....
I am interested in new students wanting to work in the area of Natural Language Processing. However, to cut down on my email load, it's necessary to put in some more information:
In my Natural Language Processing research, I am interested in pursing the following interrelated questions: How can the complex, but often insightful, models of linguistic theory be combined with statistical approaches? Knowledge is better than guessing, but at the moment, most statistical approaches use crude lexical statistics as a substitute for linguistic knowledge. My main current project is exploring linguistically-sophisticated models of natural language understanding, and their applications. The centerpiece of this research is to create NLP models that have the robustness and trainability of probabilistic models, while incorporating the linguistic sophistication of more traditional linguistic models. A particularly interesting recent paper on this topic is one by Mark Johnson et al. from ACL '99. One context for this work is cooperative research with Ivan Sag, Ann Copestake, and Dan Flickinger on the LinGO project.
Another aim is to identify useful linguistic features, representations, and techniques. My work on probabilistic left-corner models (Manning and Carpenter 1997) is an example of taking an approach of (psycho-)linguistic interest and combining it with statistical techniques. The vast majority of work in statistical NLP is competitive evaluations of different learning methods on the same data representations. There is very little work on producing different data representations that improve performance, and this is an area that I am particularly interested in.
For instance, I worked in the past on acquiring categorical subcategorization frames from raw text (Manning 1993). What I would like to pursue now is the use of probabilistic subcategorization frames, and their use in parsing. Such an approach is quite consonant with ideas that appear in linguistic papers of mine (Manning and Sag 1998, 1999; Manning, Sag and Iida, 1999), where adjuncts are freely added to argument structure lists, only that probabilities are being placed on such events. A related idea that I would like to pursue is to build into the model probabilistic conditioning based not on surface structure roles like subject and object, but a fine grained semantic classification of the type that has been extensively explored in lexical semantics. Most current statistical models are exclusively syntactic, working with the distributions of words and phrases. This approach would start to introduce a basic predicate-argument style semantics into the system. This would not only provide a more fine-grained classification of arguments, but such a parser could systematically recognize and exploit expected dependencies based on the same types of arguments appearing in different structural configurations (due to argument structure alternations).
Natural language parsers by themselves are just a tool. Real people want them to do something. I'd like to work on various uses of broad coverage parsing. One neat project is determining distinctive constructions in different world Englishes (American, British, Australian, etc.). This isn't just a matter of finding words that Americans use but Australians don't (for example) since most of the interesting stuff is more subtle than that. E.g., one would want to be able to learn that Australians say things like "I'll ring her up" while an American says "I'll give her a call", despite the fact that all these words are used frequently in both variants of English.
Recently, the move to online documents in general, and the web in particular, has lead to a great deal of interest in topics like "text data mining" or "the management and exploitation of semi-structured information", for example at the Stanford InfoLab (of which I am a member). People from a database background use the term "semi-structured information" as a codeword for material in human languages, e.g., English. Studies suggest that in most corporations, around 70% of information is stored in human language text, rather than in structured forms like databases: email, memoranda, reports, operating handbooks, policy and position papers, etc. There is thus enormous need for better more widely applicable applied NLP technology. The long history of NL failing to do better than dumb information retrieval techniques is worth remembering, but I suspect that there are productive domains for research in places where some textual understanding is required, and so I'm hoping to employ robust parsing techniques in such domains. (One such possibility would be answering questions about current news events. The goal of something working better than Ask Jeeves seems attainable....)
I'm also very interested in the implications of statistical approaches for linguistic theory. In recent linguistics there has been a divorce between formal approaches (based on logical or formal language tools) which develop rich theoretical accounts but divorce themselves from the richness of real language data, and the concerns of people in sociolinguistics, language acquisition, and historical linguistics where people are dealing with real messy data, but often without formal tools. Probabilistic approaches to grammar offer good prospects for providing formal tools for dealing with real language use. While the tools of statistical NLP give some ideas on how to approach these questions, the issues of concern are rather different. From a linguistic perspective, a central question is why languages are almost categorical.
Information Extraction:
The newst system does
information extraction from real estate for sale advertisements, yielding
structured database records. This system is in
production use for News Interactive, an
arm of News Corporation, and there is also a demo
of it handling real estate for sale.
|
|
Dictionary visualization: Kirrkirr is a Java environment for the visualization and exploration of dictionaries, currently in use for Warlpiri (an Australian Aboriginal language). The dictionary is stored in XML, with formatting done by custom graphics code and XSL(T). We're at present trialling it in Warlpiri schools. | |
Linguistic theory and linguistic typology: NLP programs can be no better than the linguistic theories that underly them, and so I continue to be interested in linguistic theories, language typology, and possibilities for implementation. I've been particularly interested in topics including ergativity, argument structure, causatives, Austronesian languages, and serial verbs. |
http://nlp.stanford.edu/~manning/research.html