Linguistics LNGS 3008 -- Computational Linguistics

First Semester, 1999
Department of Linguistics, University of Sydney
Christopher Manning

Lecturer: Chris Manning
Transient Building, 243B Phone: 9351-7516
Email: cmanning@mail.usyd.edu.au
Office Hours: TBA. See the sign on the door.

Lecture times: Tue 2-4, (13 weeks)
Lecture location: Transient 202. However, in later weeks, we will sometimes spend the second hour of class in a computer lab. Credit points: 4

Overview

This course will be a general introduction to the foundations of, and selected topics in, computational linguistics, covering both standard rule-based approaches to understanding and producing human languages, and more recent approaches to practical language engineering.

Topics include: (i) Introduction and history; (ii) Parsing and generation with phrase structure grammars, tabular and chart parsing, feature grammars, grammatical topics such as gaps, movement, and semantics; (iii) Corpora and text processing including markup, regular expression searching, collocations, concordances, clustering, and corpus-based parsing and disambiguation; (iv) An introduction to problems and approaches in speech recognition and production, information retrieval and extraction, machine translation, and understanding and generating conversational natural language.

Aims

Computational Linguistics vs. Computer Applications in Linguistics: Clarification

Computer Applications in Linguistics is a course that teaches some of the kind of software packages that are often used in linguistics and humanities computing more generally for purposes of finding key words in online texts, building dictionaries of languages (such as when doing fieldwork), analyzing spectrograms, and similar tasks. It is a course suitable for second year students, and is a "tools" course which teaches a basic familiarity in computer tools for doing linguistic things.

Computational Linguistics is a 3/4th year course. It is devoted to understanding the kinds of algorithms that people use to get computers to be able to understand and produce audio and written material in human languages, such as parsing algorithms, dialogue systems, and so on.

Prerequisites

Unofficial Assumed Prerequisites: (i.e., not what you find in the handbook, but you can get my permission if you have done): LNGS 2002 (Syntax) AND general competence with computers of the kind that could be gained in LNGS 2007 OR COMP 2[09]03 (Languages and Logic) OR a credit grade or better in both LNGS 1001 and COMP 1[09]02.

I will expect all students to have a certain willingness and enthusiasm to spend time learning how to use various computer programs. Students are expected to be familiar with the basic concepts of phonology, morphology, and syntax, and have basic familiarity with computers. Assignments/project work will be chosen to be both suitable for and to extend different students' areas of competence coming into the course -- i.e., it is expected that some people will know more about linguistics and others about computer science. You do not have to know how to program to do this course, but, if you do, you will be expected to do some, and if you don't you'll be expected to learn some basic scripting, as well as writing grammars, lexicons, and so on.

Assessment

Assessment will be based on assignments and project work. You will be expected to do five (hopefully not too onerous) assignments. The lowest mark will be dropped, and the rest will contribute equally to the final mark. Assignments not handed in on time will be penalized unless an extension is negotiated (for medical or religious reasons, etc.). The assignments will count for 60% (15% each) and the project for 40%. The project will be chosen jointly by student and lecturer and will allow extended work on one topic. It should involve something more than just library research, but this might vary between writing a program or doing some kind of corpus-based study or evaluation of existing systems.

Advice on Assignments -- Read this section carefully!

I will try to make the assignments not too long and gruesome. However, it's important to realize that you can easily get stumped by a computer problem (whether hardware, general software, or the specific software used by this class), and if you're thrashing around trying to solve such a problem without really knowing what's happening, then you can easily waste hours. Therefore, it is important to work on the assignment early, so that if any problems develop you can ask for help early (as well as asking me, it's fine to ask other students for help if you are having difficulty understanding how to install or use the software, or things like that).

Also, computer systems commonly crash or freeze, floppies get lost, and so on. You should save what you are working on regularly, and always keep a backup copy, preferably in a different place. When working on programs and grammars, a common situation is that while trying to move from something that 90% works to something that completely works, you instead move to something that is completely broken. To avoid much wasted time and premature hair loss, it's really important that you save lots of half-working solutions so that you can go back and look at them for ideas, or completely return to them later. Give them all different names (Ass3 - Almost works or Ass3 - 17/9 21:18 or whatever).

Textbook

In addition, I will make availabe various additional readings, handouts, and software guides.

For your reference, two other well-known texts on Computational Linguistics/Natural Language Processing are the following. They are more oriented around code and programs than even Allen.

Prereading

Allen, Ch. 1. (and Ch. 2 if you don't know much linguistics!)

Syllabus

The course begins with simple formal models of language, and shows how they can be used for modelling simple linguistic phenomena, and how tools based on these formalisms can give practical help to a linguist in searching and using text corpora). It then progresses through the study of the richer world of context free grammars and feature or attribute-value grammars and methods for representing them, and parsing with them, including how to handle features of natural language ranging from long distance movement to semantic interpretation. Finally, there will be a number of lectures giving overviews of other topics.

  1. 2 Mar. Bureaucracy. Introduction to the field of computational linguistics. History. What does it mean to be doing computational linguistics? Why is it difficult? Here are some web resources. You're expected to play with them!
  2. 9 Mar. Formal languages. The Chomsky hierarchy. Finite state (regular) languages. Their properties, and use in describing certain) morphology. Getting formal.
  3. 16 Mar. Phrase structure. Context-free grammars. Top-down and bottom-up parsing of context-free grammars. Starting to write context-free grammars using PC-PATR. PC-PATR has a good old-fashioned command line and text output interface. However, it has the advantage of running on all of Mac, Windows, and Unix, so I'm sticking with it for the moment.
  4. 23 Mar. Tabular and chart-parsing. From monadic to complex categories: PATR.
  5. 30 Mar. Advanced grammars: gaps, movement. Incorporating simple semantics. Structural ambiguity.
  6. 13 Apr. Techniques for encoding and resolving ambiguities. Psycholinguistic evidence. Statistical methods. Efficient Parsing.
  7. 20 Apr. From syntax to semantics (meaning). Here is my sample PATR semantic grammar. The line termination properties of PC and Mac files differ, and PC-PATR seems to like for them to be right, so here they are as Mac files: lexicon and grammar; and as PC files: lexicon and grammar. And here is the sample grammar doing the previous assignment. As Mac files: demo.grm and demo.lex, and as Windows files: demo-pc.grm and demo-pc.lex.
  8. 27 Apr. Corpus Linguistics. Collocations. Zipf's Law, word clustering, and other topics. Regular expressions.
  9. 4 May. Ambiguity resolution, especially based on corpus evidence. Word Sense Disambiguation
  10. 11 May. Discourse structure and conversational agents. What is required? How primitive are the agents that exist now?
  11. 18 May. Information Retrieval and Extraction. Web search engines and web robots. How can NLP help this kind of work?
  12. 25 May. Machine translation (including parallel texts, glossing, and alignment). A magazine introduction to Machine Translation from The Atlantic Monthly Final assignment planner: the planner with Mac line ending -- save this to the disk as source, or here is the planner with Unix line endings. (NB: After you have been altering the grammar in Sicstus Prolog, make sure you save the modified grammar! -- under the File menu.)
  13. 1 Jun. Speech recognition

http://www.sultry.arts.usyd.edu.au/cmanning/courses/compling/
Christopher Manning -- <cmanning@mail.usyd.edu.au> -- 3 June 1999