The Stanford Parser: A statistical parser

About | Questions | Mailing lists | Download | Extensions | Release history | Sample output | Online | FAQ

About

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online.

This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.

The main technical ideas behind how these parsers work appear in these papers:

Dan Klein and Christopher D. Manning. 2003. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.
Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423-430.
The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser.

As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser is included, based on

Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank?. ACL 2003.
A German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian and Bulgarian.

The parser provides Stanford typed dependencies (or grammatical relations) output as well as phrase structure trees, invoked by using the -outputFormat typedDependenciesCollapsed option. These are produced using hand-written tregex patterns as described in:

Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.

This style of output is available only for English and Chinese.

The current version of the parser requires Java 5 (JDK1.5). (You can also download an old version of the parser (version 1.4) that runs under JDK 1.4, but we may be reluctant to still answer questions particular to it.) The parser also requires plenty of memory (about 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model).

The parser is available for download, licensed under the GNU GPL. (Note that this is the full GPL - which allows its use for research purposes or other free software projects but does not allow its incorporation into proprietary software, even in part or in translation; see GPL FAQ.) Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. (Commercial licensing of the parser is also available. Please enquire.)

The download is a 60 MB gzipped tar file (mainly consisting of included grammar data files). If you unpack the tar file, you should have everything needed. Note: we use GNU tar. People seem to have had problems unpacking the file with plain old Unix tar (for reasons we don't understand). Unpacking works fine with GNU tar or with most other archive programs on other platforms, such as, for instance, either WinZip or 7-Zip on Windows. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.

Questions about the parser?

  1. Take a look at the Javadoc lexparser package documentation and LexicalizedParser class documentation. (Point your web browser at the index.html file in the included javadoc directory and navigate to those items.)
  2. Look at the parser FAQ for answers to common questions.
  3. Please send any other questions or feedback, or extensions and bugfixes to parser-user@lists.stanford.edu.

Mailing lists

We have 3 mailing lists for the parser, each @lists.stanford.edu:

  1. parser-user This is the best list to post to in order to ask questions, make announcements, or for discussion among parser users. Join the list by emailing parser-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.
  2. parser-announce This list will be used only to announce new parser versions. So it will be very low volume (expect 1-3 message a year). Join the list by emailing parser-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
  3. parser-support This list goes only to the parser maintainers. It's a good address for licensing questions, etc. For general use and support questions, you're better off joining and using parser-user. You cannot join parser-support, but you can mail questions to parser-support@lists.stanford.edu.

Download

Download Stanford Parser version 1.6 [recommended!]

Download Stanford Parser version 1.5.1
Download Stanford Parser version 1.4 (runs under Java 1.4)

Extensions: Packages by others using the parser

Stanford parser grammatical relation browser. GUI, especially focusing on grammatical relations (typed dependencies), including an editor. By Bernard Bou.

Ruby wrapper to the Stanford Natural Lanuguage Parser. By Bill McNeill.

GATE plug-in. By Adam Funk.


Release history


Version 1.62007-08-18 Added Arabic, k-best PCCFG parsing; improved English grammatical relations
Version 1.5.12006-06-11 Improved English and Chinese grammatical relations; fixed UTF-8 handling
Version 1.52005-07-21 Added grammatical relations output; fixed bugs introduced in 1.4
Version 1.42004-03-24 Made PCFG faster again (by FSA minimization); added German support
Version 1.32003-09-06 Made parser over twice as fast; added tokenization options
Version 1.22003-07-20 Halved PCFG memory usage; added support for Chinese
Version 1.12003-03-25 Improved parsing speed; included GUI, improved PCFG grammar
Version 1.02002-12-05 Initial release

Sample input and output

The parser can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format. For example, consider the text:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

The following output shows part-of-speech tagged text, then a context-free phrase structure grammar representation, and finally a typed dependency representation. All of these are different views of the output of the parser.

The/DT strongest/JJS rain/NN ever/RB recorded/VBN in/IN India/NNP
shut/VBD down/RP the/DT financial/JJ hub/NN of/IN Mumbai/NNP ,/,
snapped/VBD communication/NN lines/NNS ,/, closed/VBD airports/NNS
and/CC forced/VBD thousands/NNS of/IN people/NNS to/TO sleep/VB in/IN
their/PRP$ offices/NNS or/CC walk/VB home/NN during/IN the/DT night/NN
,/, officials/NNS said/VBD today/NN ./. 

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

det(rain-3, The-1)
amod(rain-3, strongest-2)
nsubj(shut-8, rain-3)
advmod(recorded-5, ever-4)
partmod(rain-3, recorded-5)
prep_in(recorded-5, India-7)
ccomp(said-40, shut-8)
prt(shut-8, down-9)
det(hub-12, the-10)
amod(hub-12, financial-11)
dobj(shut-8, hub-12)
prep_of(hub-12, Mumbai-14)
conj_and(shut-8, snapped-16)
nn(lines-18, communication-17)
dobj(snapped-16, lines-18)
conj_and(shut-8, closed-20)
dobj(closed-20, airports-21)
conj_and(shut-8, forced-23)
dobj(forced-23, thousands-24)
prep_of(thousands-24, people-26)
aux(sleep-28, to-27)
xcomp(forced-23, sleep-28)
poss(offices-31, their-30)
prep_in(sleep-28, offices-31)
conj_or(sleep-28, walk-33)
dobj(walk-33, home-34)
det(night-37, the-36)
prep_during(walk-33, night-37)
nsubj(said-40, officials-39)
tmod(said-40, today-41)

This output was generated with the command:

java -mx100m edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependenciesCollapsed" englishPCFG.ser.gz mumbai.txt