NLP Reading Group

Time: Friday 2pm.

Regular Place: Gates 200

If you would like to be on the mailing list for the seminar, contact Christopher Manning manning@cs.stanford.edu.

Present Schedule:

Friday, Apr 6, 2pm, Gates 200 Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank. Dan Klein and Christopher Manning. Available at: http://nlp.stanford.edu/~manning/papers/exhaustive-parsing.ps This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank, using the Treebank's own CFG grammar. We show how parser perfomance is dramatically affected by rule representation choices, and analyze the value of top-down vs. bottom-up strategies. We discuss grammatical saturation, including analysis of the strongly connected components of the phrasal nonterminals in the Treebank, and present evidence that as sentence length increases, the effective grammar size increases as regions of the grammar are unlocked, yielding super-cubic observed time behavior in some configurations.

Friday, Apr 13, 2pm, Gates 200 Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. By David Yarowsky, Grace Ngai, and Richard Wicentowski. Proceedings of Human Language Technology 2001, pp. 109-116 Presented by Chris Culy This paper isn't available online AFAIK. Copies are available in the box on top of the file cabinet opposite Gates Room 416 The paper discusses automatically inducing monolingual POS taggers, NP identifiers, and named entity recognizers for some language, by exploiting English resources, bilingual corpora, and alignments induced from them.

Friday, Apr 20, 2pm, Gates 200 Franz Josef Och, Hermann Ney. "Improved Statistical Alignment Models". Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447, Hongkong, China, October 2000. Available at: http://www-i6.informatik.rwth-aachen.de/Colleagues/och/ACL00.ps Presentation by Tolga Ilhan In this paper, we present and compare various single-word based alignment models for statistical machine translation. We discuss the five IBM alignment models, the Hidden-Markov alignment model, smoothing techniques and various modifications. We present different methods to combine alignments. As evaluation criterion we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We show that models with a first-order dependence and a fertility model lead to significantly better results than the simple models IBM-1 or IBM-2, which are not able to go beyond zero-order dependencies.

Friday, April 27, 2pm, Gates 200 Daniel Gildea and Daniel Jurafsky: "Automatic Labeling of Semantic Roles. ACL 2000, Hong Kong. Available at: http://www.icsi.berkeley.edu/~gildea/gildea-acl00.ps Presentation by John Bear. We present a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame. Various lexical and syntactic features are derived from parse trees and used to derive statistical classifiers from handannotated training data.

Friday, May 4, 2pm, Gates 200 Philip Resnik: "Mining the Web for Bilingual Text" ACL 1999, College Park, Maryland. Available at: http://umiacs.umd.edu/~resnik/pubs/acl99.ps.gz Presentation by Will Lewis. Abstract: STRAND (Resnik, 1998) is a language independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the preliminary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance. The most recent end-product is an automatically acquired parallel corpus comprising 2491 English-French document pairs, approximately 1.5 million words per language.

Friday, May 11, 2pm, Gates 200 C. Andersen, D. Traum, K. Purang, D. Purushothaman and D. Perlis. "Mixed Initiative Dialogue and Intelligence via Active Logic" In Proceedings of the AAAI'99 Workshop on Mixed-Initiative Intelligence. 1999. Presentation by Oliver Lemon. Available at: http://www.cs.umd.edu/projects/active/papers/mixinit.ps

Friday, May 18, 2pm, Gates 200 Chris Callison-Burch, Raymond S. Flournoy. A Program for Automatically Selecting the Best Output from Multiple Machine Translation Engines Presentation by Chris Callison-Burch. Available at: http://nlp.stanford.edu/nlp/AmikaiMT.ps or http://nlp.stanford.edu/nlp/AmikaiMT.rtf This paper describes a program that automatically selects the best translation from a set of translations produced by multiple commercial machine translation engines. The program is simplified by assuming that the most fluent item in the set is the best translation. Fluency is determined using a trigram language model. Results are provided illustrating how well the program performs for human ranked data as compared to each of its constituent engines.

Friday, May 25, 2pm, Gates 200 J. Dowding, B.A. Hockey, M.J. Gawron, and C. Culy. "Practical Issues in Compiling Typed Unification Grammars for Speech Recognition". To be presented at ACL 2001. Presentation by John Dowding. Available at: http://nlp.stanford.edu/nlp/jdowding.pdf Current alternatives for language modeling are statistical techniques based on large amounts of training data, and hand-crafted context-free or finite-state grammars that are difficult to build and maintain. One way to address the problems of the grammar-based approach is to compile recognition grammars from grammars written in a more expressive formalism. While theoretically straight-forward, the compilation process can exceed memory and time bounds, and might not always result in accurate and efficient speech recognition. We will describe and evaluate two approaches to this compilation problem. We will also describe and evaluate additional techniques to reduce the structural ambiguity of the language model.

Friday, Jun 1, 2pm, Gates 200 Presentation by Ellen Campana. Zenzi Griffin and Kathryn Bock. What the Eyes Say about Speaking. Psychological Science. Available at: http://nlp.stanford.edu/nlp/griff_ps.pdf Michael Tanenhaus, Michael Spivey-Knowlton, Kathleen Eberhard and Julie Sedivy. Integration of Visual and Linguistic Information in Spoken Language Comprehension. Science, 268: 1632-1634, 1995. Available at: http://nlp.stanford.edu/nlp/science.pdf

Future possible worlds

The future is wide open! Make some suggestions!

The following papers have been suggested at some point but not scheduled. But some of them are getting a little old now. You're encouraged to send votes for or against these papers, or to suggest other papers to manning@cs.stanford.edu.

Probabilistic Parse Selection Based on Semantic Cooccurrences. Eirik Hektoen. IWPT '97.

Learning Information Extraction Rules for Semi-structured and Free Text. Stephen Soderland. Machine Learning 1999? [or in press?] http://www.cs.washington.edu/homes/soderlan/WHISK.ps

Supervised Grammar Induction using Training Data with Limited Constituent Information. Rebecca Hwa. http://xxx.lanl.gov/abs/cs.CL/9905001 (NB: this link is still one step from the actual paper!)

The past

1999

8 October: Roark and Johnson ACL '99 proceedings Efficient probabilistic top-down and left-corner parsing. http://www.cog.brown.edu/~roark/tdp.ps Led by Martin Kay

15 October: Bruce and Wiebe: Computational Linguistics 25(2), 1999. Decomposing Modeling in Natural Language Processing. (There is a draft of this at http://www.cs.nmsu.edu/~wiebe/pubs/papers/bruceWiebeCL99.ps but it's not identical to the published version) Led by Chris Manning

22 October: Narayanan and Jurafsky CogSci '98 Bayesian Models of Human Sentence Processing http://www.colorado.edu/linguistics/jurafsky/srini.ps Led by Chris Culy

29 October: An Adaptive Conversational Interface for (Destination) Advice, a talk by Cindi Thompson A paper on an earlier version of the system can be retrieved from http://www-csli.stanford.edu/~cthomp/apa-cia99.ps.gz

5 November: Human-Computer Conversation. Yorick Wilks and Roberta Catizone. http://xxx.lanl.gov/abs/cs.CL/9906027 (NB: this link is still one step from the actual paper!) Chris Culy's notes.

12 November: This is a rest day, since CSLI NLP demos are scheduled until 2:30 friday.

19 November: The Relationship between the frequency and the processing complexity of linguistic structure Edward Gibson, Carson T. Schütze and Ariel Salomon Journal of Psycholinguistic Research 25(1), 1996. (Not available online, AFAIK.) Led by Ivan Sag.

26 November: Rest day. (Thanksgiving.)

3 December: Jan Jannink. Extracting semantic features from dictionaries and large text corpora

2000

14 January: Riloff, E. and Jones, R. Learning Dictionaries for Information Extraction Using Multi-level Boot-strapping (AAAI 1999) led by Chris Manning

21 January: Burnside, Strasberg, and Rubin. Automated Indexing of Mammography Reports Using Linear Least Squares Fit. Talk by Howard Strasberg and Daniel Rubin. NB: in room Gates 159 this week!

28 January: Dan Boley. Unsupervised Clustering: A Fast Scalable Method for Large Datasets, by D. Boley, V. Borst, Univ. of Minn. CS TR-99-029. ftp://ftp.cs.umn.edu/dept/users/boley/reports/Computer99.ps.gz. More details and some more examples can be found in another longer paper: Hierarchical Taxonomies using Divisive Partitioning, by D. L. Boley, Univ. of Minn. CS TR-98-012. ftp://ftp.cs.umn.edu/dept/users/boley/reports/taxonomy.ps.gz. Several other papers on this topic are included on the web page: ftp://ftp.cs.umn.edu/dept/users/boley/reports/Annotated.html

4 February: Francis Bond, Kentaro Ogura, & Satoru Ikehara (1995) "Possessive pronouns as determiners in Japanese-to-English machine translation." http://xxx.lanl.gov/abs/cmp-lg/9601006. [Note that this link is still one link away from the actual paper.] Discusses the problem of generating possessive pronouns in Japanese-to-English machine translation.

11 February. Chelba and Jelinek. Presented by Dan Klein.

18 Feb. Hearst: TextTiling. Presented by Raymonde Guindon.

25 Feb. Chelba and Jelinek part 2. Presented by Dan Klein.

3 March: freeloaded off SCLA/KSL Haym Hirsh talk on WHIRL for text categorization.

10 March: Howard Strasberg. What's Related? Generalizing Approaches to related articles in medicine

Fri Apr 21: Place: Gates 400 <== NOTE DIFFERENT ROOM!
Title: Learning Parse and Translation Decisions From Examples With Rich Context - Ulf Hermjakob, Raymond J. Mooney Proceedings of ACL/EACL'97 http://www.arxiv.org/abs/cmp-lg/9706002. Note that that link is still one step from the actual paper.

Wed Apr 26: Not an nlp-reading event, but Eugene Charniak will talk at the Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision on The Statistical Natural Language Processing Revolution. See: http://robotics.Stanford.EDU/ba-colloquium/spring00/abst-charniak.html

Fri Apr 28: Guido Minnen: Automatic article selection (talk on work in progress)

Fri May 5: Medical Text Processing - Explanation of an Information Flow Model Prof.Dr.med.Wolfgang Giere Center for Medical Informatics J.W.Goethe Univ Med Ctr. Frankfurt am Main

Friday, May 12 Time: 2pm. Place: Gates 200 Title: Empirical Assessment of Semantic Interpretation - Martin Romacker and Udo Hahn Proceedings of the 1st NAACL (2000) Presenter: Dan Klein

Friday, May 19 Time: 2pm. Place: Gates 200 Title: Compiling Language Models from a Linguistically Motivated Unification Grammar - Manny Rayner, Beth Ann Hockey, Frankie James, Elizabeth Owen Bratt, Sharon Goldwater, and Jean Mark Gawron To appear in Coling 2000. Presenter: John Dowding

Friday, June 23 Time: 2pm Places: Gates 200 Title: "Bootstrapping Syntax and Recursion using Alignment-Based Learning," by M. van Zaanen. ICML 2000. Presenter: Cindi Thompson

Friday, Sep 29, 2pm, Gates 200. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Kristina Toutanova and Christopher D. Manning. To appear in the Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. Available online: http://nlp.stanford.edu/~manning/papers/emnlp2000.ps

Friday, Oct 6, 2pm, Gates 200. An Improved Parser for Data-Oriented Lexical-Functional Analysis by Rens Bod. Discusses the evaluation of various variants of the data-oriented parsing model applied to lexical functional grammar, and proposes new better models than previously existing ones. Proceedings of ACL 2000. Available online: http://nlp.stanford.edu/nlp/bod-acl00.ps

Friday, Oct 13, 2pm, Gates 200. Explaining away ambiguity: Learning verb selectional preference with Bayesian networks. Proposes a Bayesian model for unsupervised learning of verb selectional preferences, and compares it with earlier approaches. By Massimiliano Ciaramita and Mark Johnson. Proceedings of Coling 2000. Discussion led by Dan Klein. Available online (this link is still one step from the paper): http://xxx.lanl.gov/abs/cs.CL/0008020

Friday, Oct 20, 2pm, Gates 200. "Unsupervised Discovery of Scenario-Level Patterns for Information Extraction" by Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Discussion led by John Bear. It appears on p. 282 in the proceedings of the NAACL/ANLP 2000 conference that took place in Seattle. As far as we know, it isn't available online. Starting tomorrow, there should be copies in the box on the filing cabinet opposite Gates room 416.

Friday, Oct 27, 2pm, Gates 498. Tree-gram Parsing: Lexical Dependencies and Structural Relations. Kahlil Sima'an. ACL 2000. Argues for the complementarity of lexical dependencies and structural relations in parsing, and focusses on the latter. Discussion led by Tolga Ilhan. Available online: http://nlp.stanford.edu/nlp/Simaan.pdf

Friday, Nov 3: none (Chris away)

Friday, Nov 10: none (CSLI IAP meeting)

Friday, Nov 17, 2pm, Gates 200 Discussion led by John Dowding. I will talk about the recent AAAI Fall Symposium on Building Dialogue Systems for Tutorial Applications. I will give some background and context setting, and give an overview of the systems described at the symposium. Then, I want to look at the language analysis component of one system, AutoTutor, developed at the University of Memphis and Rhodes College, and their decision to use Latent Semantic Analysis. (LSA) A description and evaluation of LSA in their system is given at: http://mnemosyne.csl.psyc.memphis.edu/home/graesser/PSOTKA.htm The HTML formatting for that document is a bit off, but it prints out OK for me. For background on LSA, see: http://www2.hu-berlin.de/linguistik/institut/syntax/mind/landauer.htm

Friday, Nov 24: none (Thanksgiving)

Friday, Dec 1: Mark Seidenberg and Maryellen MacDonald: A Probabilistic Constraints Approach to Language Acquisition and Processing Discussion led by Tom Wasow.

2001

Friday, Jan 12, 2pm, Gates 200. Including Biological Literature Improves Homology Search by Jeffrey T. Chang, Soumya Raychaudhuri, and Russ B. Altman. Presented by Jeff Chang. Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision. Available online: http://www.smi.stanford.edu/projects/helix/psb01/chang.pdf

Friday, Jan 19, 2pm, Gates 200. The Semantics of temporal IN A presentation by David Bree This paper presents a method for extracting the temporal information from IN phrases. IN is the preposition most frequently used to convey temporal information and so a solution to its semantics will be a good guide to the semantics of other propsitions that are used temporally. The method takes the form of a set of rules and heuristics. It is derived from an analysis of a one-million word corpus of English texts.

Friday, Jan 26, 2pm, Gates 200 Paper presentation by Roger Levy. Dekai Wu, Hongsing Wong: Machine Translation with a Stochastic Grammatical Channel. COLING-ACL 1998: 1408-1415

Friday, Feb 2, 2pm, Gates 200 Paper presentation by Bob Moore (Microsoft Research). Eugene Charniak: A Maximum-Entropy-Inspired Parser. Proceedings of NAACL-2000. Presents the latest and best results in broad-coverage statistical parsing, in particular using a new "maximum-entropy inspired" estimation method. It's available online at http://nlp.stanford.edu/nlp/shortMeP.ps or from Eugene Charniak's home page: http://www.cs.brown.edu/people/ec/

Friday, Feb 9, 2pm, Gates 200 Anne Bracy. Presenting: "Dialogue modelling and management in a multi-modal robot interface" by Oliver Lemon, Alexander Gruenstein, Anne Bracy, Stanley Peters Abstract: We explain the dialogue modelling and management techniques used in a multi-modal interface for conversations with autonomous agents. This setting presents important challenges for dialogue system engineers. We describe some general problems, and their solution in a conversational interface to a robot helicopter (or UAV). Its main innovation is a dialogue manager which implements a dynamic information state model of dialogue." Available at: http://www-csli.stanford.edu/~lemon/diasys.ps.gz.

Friday, Feb 16: none (AAAS meetings)

Friday, Feb 23: none (lack of volunteers)

Friday, Mar 2, 2pm, Gates 200 Annotating the CallHome Japanese speech corpus. John Fry, Stanford Linguistics Dept. The CallHome Japanese (CHJ) corpus consists of transcripts and digitized speech data for 120 spontaneous telephone conversations between native speakers of Japanese (LDC 1996). The CHJ corpus was originally collected for research on large-vocabulary speech recognition. For the last two years, however, my colleagues and I have been annotating the CHJ transcripts with lexical, prosodic, and semantic information in order to make the corpus a useful resource for other areas of linguistic research such as ellipsis, topicalization, and speaker disfluencies. Our annotations include: (i) the pronunciation and POS of each word in the corpus; (ii) acoustic and prosodic data like duration, fundamental frequency, and energy; (iii) the semantic properties (e.g., human, inanimate, etc.) of 46,000 nouns; and (iv) verb senses and basic predicate-argument structure for 24,000 verbs. Some of these annotations could be generated automatically, while others relied the judgments of native speakers. After describing how the annotations were (or in some cases, are still being) carried out, I will discuss potential research applications for the annotated corpus.

Friday, Mar 9, 2pm, Gates 200 Paper presentation by Kathryn Campbell-Kibler : Cosmin Popovici and Paolo Baggia: Language Modelling for Task-Oriented Domains. Proceedings of EUROSPEECH'97, Rhodes, Greece, vol. 3, pp. 1459-1462 This paper is focused on the language modelling for task-oriented domains and presents an accurate analysis of the utterances acquired by the Dialogos spoken dialogue system. Dialogos allows access to the Italian Railways timetable by using the telephone over the public network. The language modelling aspects of specificity and behaviour to rare events are studied. A technique for getting a language model more robust, based on sentences generated by grammars, is presented. Experimental results show the benefit of the proposed technique. The increment of performance between language models created using grammars and usual ones, is higher when the amount of training material is limited. Therefore this technique can give an advantage especially for the development of language models in a new domain. Available at: http://xxx.lanl.gov/abs/cmp-lg/9711007 [this link is still one step from the actual paper]

Friday, Mar 16, 2pm, Gates 200 Challenges in Adapting an Interlingua for Bidirectional English-Italian Translation. Proceedings of AMTA 2000. Violetta Cavalli-Sforza (San Francisco State University) Krzysztof Czuba, Teruko Mitamura, Eric Nyberg (Language Technologies Institute, Carnegie Mellon University). Paper presented by Violetta Cavalli-Sforza. We describe our experience in adapting an existing high-quality, interlingual, unidirectional machine translation system to a new domain and bidirectional translation for a new language pair (English and Italian). We focus on the interlingua design changes which were necessary to achieve high quality output in view of the language mismatches between English and Italian. The representation we propose contains features that are interpreted differently, depending on the translation direction. This decision simplified the process of creating the interlingua for individual sentences, and allows the system to defer mapping of language-specific features (such as tense and aspect), which are realized when the target syntactic feature structure is created. We also describe a set of problems we encountered in translating modal verbs, and discuss the representation of modality in our interlingua. Paper is available at: http://www.lti.cs.cmu.edu/Research/Kant/. This link is still one step from the paper: It's paper number 27.

Christopher Manning -- <manning@cs.stanford.edu>
Last modified: Thu May 31 13:10:16 PDT 2001