Linguistics LNGS 3008 -- Computational Linguistics

First Semester, 1999
Department of Linguistics, University of Sydney
Christopher Manning

Lecturer: Chris Manning
Transient Building, 243B Phone: 9351-7516
Email: cmanning@mail.usyd.edu.au
Office Hours: TBA. See the sign on the door.

Lecture times: Tue 2-4, (13 weeks)
Lecture location: Transient 202. However, in later weeks, we will sometimes spend the second hour of class in a computer lab. Credit points: 4

Overview

This course will be a general introduction to the foundations of, and selected topics in, computational linguistics, covering both standard rule-based approaches to understanding and producing human languages, and more recent approaches to practical language engineering.

Topics include: (i) Introduction and history; (ii) Parsing and generation with phrase structure grammars, tabular and chart parsing, feature grammars, grammatical topics such as gaps, movement, and semantics; (iii) Corpora and text processing including markup, regular expression searching, collocations, concordances, clustering, and corpus-based parsing and disambiguation; (iv) An introduction to problems and approaches in speech recognition and production, information retrieval and extraction, machine translation, and understanding and generating conversational natural language.

Aims

To give students a sense of the possibilities for processing human languages via computers
To introduce students to some major areas of Natural Language Processing
To explain commonly used algorithms and methods in Computational Linguistics.
To teach students various computational tools and concepts that can be used to manipulate textual data (such as regular expressions, KWIC concordances, and so on)
To give students experience in writing lexicons, grammars, morphological rules, and so on for computational linguistic systems
To develop skills in dealing with formal, computational systems, and in solving problems with such systems. In other words, this course really emphasizes analytical skills, problem solving, and persistence.

Computational Linguistics vs. Computer Applications in Linguistics: Clarification

Computer Applications in Linguistics is a course that teaches some of the kind of software packages that are often used in linguistics and humanities computing more generally for purposes of finding key words in online texts, building dictionaries of languages (such as when doing fieldwork), analyzing spectrograms, and similar tasks. It is a course suitable for second year students, and is a "tools" course which teaches a basic familiarity in computer tools for doing linguistic things.

Computational Linguistics is a 3/4th year course. It is devoted to understanding the kinds of algorithms that people use to get computers to be able to understand and produce audio and written material in human languages, such as parsing algorithms, dialogue systems, and so on.

Prerequisites

Unofficial Assumed Prerequisites: (i.e., not what you find in the handbook, but you can get my permission if you have done): LNGS 2002 (Syntax) AND general competence with computers of the kind that could be gained in LNGS 2007 OR COMP 2[09]03 (Languages and Logic) OR a credit grade or better in both LNGS 1001 and COMP 1[09]02.

I will expect all students to have a certain willingness and enthusiasm to spend time learning how to use various computer programs. Students are expected to be familiar with the basic concepts of phonology, morphology, and syntax, and have basic familiarity with computers. Assignments/project work will be chosen to be both suitable for and to extend different students' areas of competence coming into the course -- i.e., it is expected that some people will know more about linguistics and others about computer science. You do not have to know how to program to do this course, but, if you do, you will be expected to do some, and if you don't you'll be expected to learn some basic scripting, as well as writing grammars, lexicons, and so on.

Assessment

Assessment will be based on assignments and project work. You will be expected to do five (hopefully not too onerous) assignments. The lowest mark will be dropped, and the rest will contribute equally to the final mark. Assignments not handed in on time will be penalized unless an extension is negotiated (for medical or religious reasons, etc.). The assignments will count for 60% (15% each) and the project for 40%. The project will be chosen jointly by student and lecturer and will allow extended work on one topic. It should involve something more than just library research, but this might vary between writing a program or doing some kind of corpus-based study or evaluation of existing systems.

Advice on Assignments -- Read this section carefully!

I will try to make the assignments not too long and gruesome. However, it's important to realize that you can easily get stumped by a computer problem (whether hardware, general software, or the specific software used by this class), and if you're thrashing around trying to solve such a problem without really knowing what's happening, then you can easily waste hours. Therefore, it is important to work on the assignment early, so that if any problems develop you can ask for help early (as well as asking me, it's fine to ask other students for help if you are having difficulty understanding how to install or use the software, or things like that).

Also, computer systems commonly crash or freeze, floppies get lost, and so on. You should save what you are working on regularly, and always keep a backup copy, preferably in a different place. When working on programs and grammars, a common situation is that while trying to move from something that 90% works to something that completely works, you instead move to something that is completely broken. To avoid much wasted time and premature hair loss, it's really important that you save lots of half-working solutions so that you can go back and look at them for ideas, or completely return to them later. Give them all different names (Ass3 - Almost works or Ass3 - 17/9 21:18 or whatever).

Textbook

James Allen. 1995. Natural Language Understanding. Redwood City, CA: Benjamin Cummings. Sorry that it costs a lot.

In addition, I will make availabe various additional readings, handouts, and software guides.

For your reference, two other well-known texts on Computational Linguistics/Natural Language Processing are the following. They are more oriented around code and programs than even Allen.

Gerald Gazdar and Christopher Mellish. 1989. Natural Language Processing in Lisp/Prolog: An Introduction to Computational Linguistics. Reading, MA: Addison-Wesley.
Fernando C. N. Pereira and Stuart M. Shieber. 1985. Prolog and Natural-Language Analysis. Stanford, CA: CSLI Publications.

Prereading

Allen, Ch. 1. (and Ch. 2 if you don't know much linguistics!)

Syllabus

The course begins with simple formal models of language, and shows how they can be used for modelling simple linguistic phenomena, and how tools based on these formalisms can give practical help to a linguist in searching and using text corpora). It then progresses through the study of the richer world of context free grammars and feature or attribute-value grammars and methods for representing them, and parsing with them, including how to handle features of natural language ranging from long distance movement to semantic interpretation. Finally, there will be a number of lectures giving overviews of other topics.

2 Mar. Bureaucracy. Introduction to the field of computational linguistics. History. What does it mean to be doing computational linguistics? Why is it difficult?
- Allen, Chapter 1. If you are not familiar with any of the material covered in Allen, Chapter 2, then you should read that chapter this week too!
Here are some web resources. You're expected to play with them!
- Postmodern essay generator. If you press reload a few times, it will generate a new essay each time. This could lighten your workload in some courses. But, seriously, from the point of view of this course, what does such a program understand, and what attributes of the genre does it take advantage of to fake intelligence?
- A simple Eliza-like program on the web is Fred.
- John Lawler's page will point you at the ChomskyBot. This is very simple, but has the advantage that you can find there discussion and Perl source for how it works.
- At Simon Laven's Chatterbot page, you can find more 'bots than I've got the spare time to play with. Look for the neat Java chat, but.
- Jason Hutchens' page for chatterbots. Jason has been a regular and enthusiastic producer of chatterbots. Indeed, he has been the regular Australian contestant in the competitions for such things. Unfortunately, the web connections to his chatterbots weren't working when I tried yesterday.
9 Mar. Formal languages. The Chomsky hierarchy. Finite state (regular) languages. Their properties, and use in describing certain) morphology. Getting formal.
- Allen, pp. 70-72 briefly covers this kind of material. Handouts.
- JFLAP is in theory a nice tool for making FSA. In practice, though, I found that the range of unreliability on different combinations of browsers and OSes is rather problematic. It runs fine on Windows 95, but. You either need a middle mouse button or to hold down the ALT key and left click. Look here for help.
16 Mar. Phrase structure. Context-free grammars. Top-down and bottom-up parsing of context-free grammars. Starting to write context-free grammars using PC-PATR. PC-PATR has a good old-fashioned command line and text output interface. However, it has the advantage of running on all of Mac, Windows, and Unix, so I'm sticking with it for the moment.
- Allen, Ch. 3
23 Mar. Tabular and chart-parsing. From monadic to complex categories: PATR.
- Allen, Ch. 4.
30 Mar. Advanced grammars: gaps, movement. Incorporating simple semantics. Structural ambiguity.
- Allen, Ch. 5.
13 Apr. Techniques for encoding and resolving ambiguities. Psycholinguistic evidence. Statistical methods. Efficient Parsing.
- Allen, Ch. 6-7 (selectively).
20 Apr. From syntax to semantics (meaning). Here is my sample PATR semantic grammar. The line termination properties of PC and Mac files differ, and PC-PATR seems to like for them to be right, so here they are as Mac files: lexicon and grammar; and as PC files: lexicon and grammar. And here is the sample grammar doing the previous assignment. As Mac files: demo.grm and demo.lex, and as Windows files: demo-pc.grm and demo-pc.lex.
- Allen, Ch. 8-9 (selectively).
27 Apr. Corpus Linguistics. Collocations. Zipf's Law, word clustering, and other topics. Regular expressions.
4 May. Ambiguity resolution, especially based on corpus evidence. Word Sense Disambiguation
- Allen, Ch. 10.
11 May. Discourse structure and conversational agents. What is required? How primitive are the agents that exist now?
- Allen Ch. 16-17 (selectively)
18 May. Information Retrieval and Extraction. Web search engines and web robots. How can NLP help this kind of work?
25 May. Machine translation (including parallel texts, glossing, and alignment). A magazine introduction to Machine Translation from The Atlantic Monthly Final assignment planner: the planner with Mac line ending -- save this to the disk as source, or here is the planner with Unix line endings. (NB: After you have been altering the grammar in Sicstus Prolog, make sure you save the modified grammar! -- under the File menu.)
1 Jun. Speech recognition
- Allen, Appendix C

http://www.sultry.arts.usyd.edu.au/cmanning/courses/compling/

Christopher Manning -- <cmanning@mail.usyd.edu.au> -- 3 June 1999