Stanford Parser FAQ

Questions

  1. How can I unpack the gzipped tar file?
  2. Is there technical documentation for the parser?
  3. What is the inventory of tags, phrasal categories, and typed dependencies in your parser?
  4. Can I train the parser?
  5. How do I force the parser to use my sentence delimitations?
  6. How can I provide the correct tokenization of my sentence to the parser?
  7. Can I give the parser part-of-speech (POS) tagged input and force the parser to use those tags?
  8. Is it possible to pre-annotate the corpus with phrasal boundaries and labels which the parser has to use?
  9. Can I obtain multiple parse trees for a single input sentence?
  10. I don't [understand/like/agree with] the parse tree that is assigned to my sentence. Can you [explain/fix] it?
  11. Why does the parser accept incorrect/ungrammatical sentences?
  12. How much memory do I need to parse long sentences?
  13. What does an UnsupportedClassVersionError mean?
  14. How can I obtain just the results of the POS tagger for each word in a sentence?
  15. Can I just get your typed dependencies (grammatical relations) output from the trees produced by another parser?
  16. How can something be the subject of another thing when neither is a verb?
  17. Can I just use your tokenizers for other purposes?
  18. How can I parse my gigabytes of text more quickly?
  19. Can you give me some help in getting started parsing Chinese?
  20. Can you give me some help in getting started parsing Arabic?
  21. Can I just use the parser as a vanilla PCFG parser?
  22. Can you give me complete documentation of command-line options/public APIs/included grammars/...?
  23. What output formats can I get with the -outputFormat and -outputFormatOptions options?
  24. Can I have the parser run as a filter (that is, parse stuff typed in)?

Please send any other questions or feedback, or extensions and bugfixes to parser-support@lists.stanford.edu.


Questions with answers

  1. How can I unpack the gzipped tar file?

    On Unix, try using GNU tar, if you're not already. (If you're using Linux, you're almost certainly using GNU tar.) For some reason we don't understand, it doesn't seem to unpack with classic Unix tar. Make sure you specify the -z option if you are not gunzipping it in advance: tar -xzf filename.

    On Windows, it unpacks fine with most common tools, such as WinZip or 7-Zip. The latter is open source. (As of Sep 2007, WinRAR doesn't work: it apparently does not handle tar files correctly.)

    On the Mac, just double-click it to unpack. The default unarchiver (BOMArchiveHelper) works fine. To make it easier to run the parser from the GUI by double-clicking, you should rename lexparser-gui.csh to lexparser-gui.command.

    If it won't unpack, you normally have either a corrupted download (try downloading it again) or there is some configuration error on your system, which we can't help with. The download should be 60,783,166 bytes with an MD5 checksum of 6a4929a2d4e93697ea9d688ec63e3d6a (for version 1.6).

  2. Is there technical documentation for the parser?

    There is considerable Javadoc documentation included in the javadoc/ directory of the distribution. You should start by looking at the javadoc for the parser.lexparser package and the LexicalizedParser class.

    (The documentation appearing on the nlp.stanford.edu website refers to code under development and is not necessarily consistent with the released version of the parser.) If you're interested in the theory and algorithms behind how the parser works, look at the research papers listed.

  3. What is the inventory of tags, phrasal categories, and typed dependencies in your parser?

    For part-of-speech tags and phrasal categories, this depends on the language and treebank on which the parser was trained (and was decided by the treebank producers not us). The parser can be used for English, Chinese, Arabic, or German (among other languages). For part of speech and phrasal categories, here are relevant links:

    Please read the documentation for each of these corpora to learn about their tagsets and phrasal categories. You can often also find additional documentation resources by doing web searches.

    The typed dependency (grammatical relations) output available for English and Chinese was defined by us. For English, there is an introduction in the paper:

    Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.

    Further information (definitions and examples of nearly all the grammatical relations) appear in the included Javadoc documentation. Look at the EnglishGrammaticalRelations and ChineseGrammaticalRelations classes. (To do this, with a web browser Open File on the index.html file in the javadoc folder of the parser distribution, and then click on the given class names in the bottom-left scroll list.) At some point we may produce better user-level documentation of these relations, but this is what is available currently.

    A corpus of English biomedical texts, with hand-corrected annotations in a slight variant of the Stanford typed dependency format is available from The BioInfer project.

  4. Can I train the parser?

    Yes, you can train a parser. You will need a collection of syntactically annotated data such as the Penn Treebank to train the parser. If they are not in the same format as currently supported Treebanks, you may need to write classes to read in the trees, etc. Read the Javadocs for the main method of the LexicalizedParser class, particularly the -train option to find out about the command options for training parsers. The supplied file makeSerialized.csh shows exactly what options we used to train the parsers that are included in the distribution. If you want to train the parser on a new language and/or treebank format, you can (and people have done so), but you need to spend a while learning about the code, especially if you wish to develop language-specific features. Start by trying to train a plain PCFG on the data, and then look at the TreebankLangParserParams class for how to do language-specific processing.

  5. How do I force the parser to use my sentence delimitations? I want to give the parser a list of sentences, one per line, to parse.

    Use the -sentences option. If you want to give the parser one sentence per line, include the option -sentences newline in your invocation of LexicalizedParser.

  6. How can I provide the correct tokenization of my sentence to the parser?

    From the commandline, if you give the option -tokenized, then the parser will assume white-space separated tokens, and use your tokenization as is. Of course, parsing will suffer unless your tokenization accurately matches the tokenization of the underlying treebank, for instance Penn Treebank tokenization. A common occurrence is that your text is already correctly tokenized but does not escape characters the way the Penn Treebank does (turning parentheses into -LRB- and -RRB-, and putting a backslash in front of forward slashes and asterisks - presumably a holdover from Lisp). In this case, you can use the -tokenized option but also add the flag:

    -escaper edu.stanford.nlp.process.PTBEscapingProcessor

    If calling the parser within your own program, the main parse methods take a List of words which should already be correctly tokenized and escaped before calling the parser. You don't need to and cannot give the -tokenized option. If you have untokenized text, it needs to tokenized before parsing. You may use the parse method that takes a String argument to have this done for you or you may be able to use of classes in the process package, such as DocumentPreprocessor and PTBTokenizer for tokenization, much as the main method of the parser does. Or you may want to use your own tokenizer.

  7. Can I give the parser part-of-speech (POS) tagged input and force the parser to use those tags?

    Yes, you can. However, you will need to provide correctly tokenized input if you want to provide POS-annotated input. (That is, the input must be tokenized and normalized exactly as the material in the treebank underlying the grammar is.)

    Read the Javadocs for the main method of the LexicalizedParser class. The relevant options are -sentences (see above), -tokenized, and -tagSeparator. If, for example, you want to denote a POS tag by the suffix /POS on a word, you would include the options -tokenized -tagSeparator / in your invocation of LexicalizedParser. You could then give the parser input such as

    The/DT quick/JJ brown/JJ fox/NN jumped/VBD over/IN the/DT lazy/JJ dog/NN ./.

    Partially-tagged input (only indicating the POS of some words) is also OK.

    If you wish to work with POS-tagged text programmatically, then things are different. You pass to the parse method a List (Sentence). If the items in this list implement HasTag, such as being of type TaggedWord, then the parser will use the tags that you provide. You can use the DocumentPreprocessor class, as the main method does, to produce these lists, or you could use WhitespaceTokenizer followed by WordToTaggedWordProcessor, or you can do this with code that you write.

  8. Is it possible to pre-annotate the corpus with phrasal boundaries and labels which the parser has to use?

    Not yet, but in the future, very possibly.

  9. Can I obtain multiple parse trees for a single input sentence?

    Yes, for the PCFG parser (only). With a PCFG parser, you can give the option -printPCFGkBest n and it will print the n highest-scoring parses for a sentence. They can be printed either as phrase structure trees or as typed dependencies in the usual way via the -outputFormat option, and each receives a score (log probability). The k best parses are extracted efficiently by using the algorithm of Huang and Chiang (2005).

  10. I don't [understand/like/agree with] the parse tree that is assigned to my sentence. Can you [explain/fix] it?

    This may be because the parser chose an incorrect structure for your sentence, or because the phrase structure annotation conventions used for training the parser don't match your expectations. To make sure you understand the annotation conventions, please read the bracketing guidelines for the parser model that you're using, which are referenced above. Or it may be because the parser made a mistake. While our goal is to improve the parser when we can, we can't fix individual examples. The parser is just choosing the highest probability analysis according to its grammar.

  11. Why does the parser accept incorrect/ungrammatical sentences?

    This parser is in the space of modern statistical parsers whose goal is to give the most likely sentence analysis to a list of words. It does not attempt to determine grammaticality, though it will normally prefer a "grammatical" parse for a sentence if one exists. This is appropriate in many circumstances, such as when wanting to interpret user input, or dealing with conversational speech, web pages, non-native speakers, etc.

    For other applications, such as grammar checking, this is less appropriate. One could attempt to assess grammaticality by looking at the probabilities that the parser returns for sentences, but it is difficult to normalize this number to give a useful "grammaticality" score, since the probability strongly depends on other factors like the length of the sentence, the rarity of the words in the sentence, and whether word dependencies in the sentence being tested were seen in the training data or not.

  12. How much memory do I need to parse long sentences?

    The parser uses considerable amounts of memory. If you see a java.lang.OutOfMemoryError, you either need to give the parser more memory or to take steps to reduce the memory needed. (You give java more memory at the command line by using the -mx flag, for example -mx500m.)

    Memory usage by the parser depends on a number of factors:

    Below are some statistics for 32-bit operation with the supplied englishPCFG and englishFactoredGrammars. We have parsed sentences as long as 234 words, but you need lots of RAM and patience.

    LengthPCFGFactored
    2050 MB250 MB
    50125 MB600 MB
    100350 MB2100 MB
  13. What does an UnsupportedClassVersionError mean?

    If you see the error:

    Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/parser/lexparser/LexicalizedParser (Unsupported major.minor version 49.0)

    This means that you don't have JDK 1.5 installed. You should upgrade at java.sun.com.

  14. How can I obtain just the results of the POS tagger for each word in a sentence?

    You can use the -outputFormat wordsAndTags option. Note: if you want to tag a lot of text, it'd be much faster to use a dedicated POS tagger (such as ours or someone else's), since this option has the parser parse the sentences and just not print the other information. There isn't a separate included tagger; the parser does POS tagging as part of parsing.

  15. Can I just get your typed dependencies (grammatical relations) output from the trees produced by another parser?

    Yes, you can. You can use the main method of EnglishGrammaticalStructure (for English, or the corresponding class for Chinese). You can give it options like -treeFile to read in trees, and, say, -collapsed to output typedDependenciesCollapsed. For example, this command (with appropriate paths) will convert a Penn Treebank file to uncollapsed typed dependencies:

    java -cp stanford-parser-2007-08-19/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile wsj/02/wsj_0201.mrg -basic

    Also, here is a sample Java class that you can download that converts from an input file of trees to typed dependencies.

    Fine print: There is one subtlety. The conversion code generally expects Penn Treebank style trees which have been stripped of functional tags and empty elements. This generally corresponds to the output of the Stanford, Charniak or Collins/Bikel parsers. The exception is that it gets value from the -TMP annotation on bare temporal NPs in order to recognize them as having temporal function (tmod). (It also allows a -ADV annotation on NPs.) Without the temporal annotation, some simple temporals like today will still be recognized, but a bare temporal like last week in I left last week will be tagged as an object (dobj). With the Stanford parser, you can get marking of temporal NPs in the tree output by giving the option -retainTmpSubcategories, either on the command line or by passing it to the setOptionFlags(String[]) method of the parser.

  16. How can something be the subject of another thing when neither is a verb? I tried the sentence Jill is a teacher and the parser created a nsubj dependency between teacher and Jill. Is that a mistake or have I not understood what nsubj is?

  17. This is an element of the dependency analysis we adopted. It's not uncontroversial, and it could have been done differently, but we'll try to explain briefly why we did things the way we did. The general philosophy of the grammatical relations design is that main predicates should be heads and auxiliaries should not. So, for the sentence Jill is singing, you will see nsubj(singing, Jill). We feel that this is more useful for most semantic interpretation applications, because it directly connects the main predicate with its arguments, while the auxiliary is rendered as modifying the verb (aux(singing, is)). Most people seem to agree.

    What then when the main predicate is an adjective or a noun? That is, sentences like Jill is busy or Jill is a teacher. We continue to regard the adjective or noun as the predicate of which the subject is the argument, rather than changing and now regarding the copular verb is as the head and busy/teacher as a complement. That is, we produce nsubj(busy, Jill) and nsubj(teacher, Jill). This frequently seems to confuse people, because the main predicate of the clause is now not a verb. But we believe that this is the best thing to do for several reasons:

    1. Consistency of treatment of auxiliary/copula between English periphrastic verb forms and adjectival/nominal predications.
    2. Crosslinguistic generalization of the grammatical relations system: many other languages sometimes or always do not use a copular verb when using an adjective or noun predicate. That is, they will just say Jill busy.
    3. Connection to logical representations: If you were to translate these sentences into a simple predicate logic form, you would presumably use busy(jill) and teacher(jill). The treatment of the adjective or noun as the predicate in a predicate logic form parallels what we do in our grammatical relations representation.
    4. Similarity of linking across constructions. While the representation differs, both the attributive (the white daisy) and the predicative (the daisy is white) use of adjectives yields a direct link between the adjective (white) and the noun (daisy).

  18. Can I just use your tokenizers for other purposes?

    Yes, you can. Various tokenizers are included. The one used for English is called PTBTokenizer. It is a hand-written rule-based (FSM) tokenizer, but is quite accurate over newswire-style text. Because it is rule-based it is quite fast (about 100,000 tokens per second on an Intel box in 2007). You can use it as follows:

    java edu.stanford.nlp.process.PTBTokenizer inputFile > outputFile

    There are several options, including one for batch-processing lots of files; see the Javadoc documentation of the main method of PTBTokenizer.

  19. How can I parse my gigabytes of text more quickly?

    There's not much in the way of secret sauce (partly by the design of the parsers as guaranteed to find model optimal solutions). If you're not using englishPCFG.ser.gz for English, then you should be - it's much faster than the Factored parser. If you can exclude extremely long sentences (especially ones over 60 words or so), then that helps since they take disproportionately long times to parse. If POS-tagging sentences prior to parsing is an option, that speeds things up (less possibilities to search).

    The parser doesn't support multithreading (don't expect correct results if you try it!). The main tool remaining is to run multiple parsing processes in parallel. This can be on multiple machines, but you can usefully run multiple parsing processes on one machine if you have dual CPU/dual core machines and enough memory. We've parsed at a rate of about 1,000,000 sentences a day by distributing the work over 6 dual processor machines.

  20. Can you give me some help in getting started parsing Chinese?

    Sure!! These instructions concentrate on parsing from the command line, since you need to use that to be able to set most options. But you can also use the parser on Chinese from within the GUI.

    The parser is supplied with 5 Chinese grammars (and, with access to suitable training data, you could train other versions). All of these are trained on data from the Penn Chinese Treebank, and you should consult their site for details of the syntactic representation of Chinese which they use. They are:

     PCFGFactoredFactored, segmenting
    Xinhua (mainland, newswire) xinhuaPCFG.ser.gz xinhuaFactored.ser.gz xinhuaFactoredSegmenting.ser.gz
    Mixed Chinese chinesePCFG.ser.gz chineseFactored.ser.gz

    The PCFG parsers are smaller and faster. But the Factored parser is significantly better for Chinese, and we would generally recommend its use. The xinhua grammars are trained solely on Xinhua newspaper text from mainland China. We would recommend their use for parsing material from mainland China. The chinese grammars also include some training material from Hong Kong SAR and Taiwan. We'd recommend their use if parsing material from these areas or a mixture of text types. Four of the parsers assume input that has already been word segmented, while the fifth does word segmentation internal to the parser. This is discussed further below. The parser also comes with 3 Chinese example sentences, in files whose names all begin with chinese.

    Character encoding: The first thing to get straight is the character encoding of the text you wish to parse. By default, our Chinese parser uses GB18030 (the native character encoding of the Penn Chinese Treebank and the national encoding of China) for input and output. However, it is very easy to parse text in another character encoding: you simply give the flag -encoding encoding to the parser, where encoding is a character set encoding name recognized within Java, such as: UTF-8, Big5-HKSCS, or GB18030. This changes the input and output encoding. If you want to display the output in a command window, you separately also need to work out what character set your computer supports for display. If that is different to the encoding of the file, you will need to convert the encoding for display. If any of this encoding stuff is wrong, then you are likely to see gibberish. Here are example commands for parsing two of the test files, one in UTF-8 and one in GB18030. The (Linux) computer that this is being run on is set up to work with UTF-8 (and this webpage is also in UTF-8), so for the case of GB18030, the output is piped through the Unix iconv utility for display.

    $ java -server -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -encoding utf-8 /u/nlp/data/lexparser/chineseFactored.ser.gz chinese-onesent-utf8.txt
    Loading parser from serialized file /u/nlp/data/lexparser/chineseFactored.ser.gz ... done [20.7 sec].
    Parsing file: chinese-onesent-utf8.txt with 2 sentences.
    Parsing [sent. 1 len. 8]: 俄国 希望 伊朗 没有 制造 核武器 计划 。
    (ROOT
      (IP
        (NP (NR 俄国))
        (VP (VV 希望)
          (IP
            (NP (NR 伊朗))
            (VP (VE 没有)
              (NP (NN 制造) (NN 核武器) (NN 计划)))))
        (PU 。)))
    
    Parsing [sent. 2 len. 6]: 他 在 学校 里 学习 。
    (ROOT
      (IP
        (NP (PN 他))
        (VP
          (PP (P 在)
            (LCP
              (NP (NN 学校))
              (LC 里)))
          (VP (VV 学习)))
        (PU 。)))
    
    Parsed file: chinese-onesent-utf8.txt [2 sentences].
    Parsed 14 words in 2 sentences (6.55 wds/sec; 0.94 sents/sec).
    
    $ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser chineseFactored.ser.gz chinese-onesent |& iconv -f gb18030 -t utf-8
    Loading parser from serialized file chineseFactored.ser.gz ... done [13.3 sec].
    Parsing file: chinese-onesent with 1 sentences.
    Parsing [sent. 1 len. 10]: 他 和 我 在 学校 里 常 打 桌球 。
    (ROOT
      (IP
        (NP (PN 他)
          (CC 和)
          (PN 我))
        (VP
          (PP (P 在)
            (LCP
              (NP (NN 学校))
              (LC 里)))
          (ADVP (AD 常))
          (VP (VV 打)
            (NP (NN 桌球))))
        (PU 。)))
    
    Parsed file: chinese-onesent [1 sentences].
    Parsed 10 words in 1 sentences (10.78 wds/sec; 1.08 sents/sec).
    

    Normalization: As well as the character set, there are also issues of "normalization" for characters: for instance, basic Latin letters can appear in either their "regular ASCII" forms or as "full width" forms, equivalent in size to Chinese characters. Character normalization is something we may revisit in the future, but at present, the parser was trained on text which mainly has fullwidth Latin letters and punctuation and does no normalization, and so you will get far better results if you also represent such characters as fullwidth letters. The parser does provide an escaper that will do this mapping for you on input. You can invoke it with the -escaper flag, by using a command like the following (which also shows output being sent to a file):

    $ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper -sentences newline chineseFactored.ser.gz chinese-onesent > chinese-onesent.stp
    

    Word segmentation: Chinese is not normally written with spaces between words. But the examples shown above were all parsing text that had already been segmented into words according to the conventions of the Penn Chinese Treebank. For best results, we recommend that you first segment input text with a high quality word segmentation system which provides word segmentation according to Penn Chinese Treebank conventions (note that there are many different conventions for Chinese word segmentation...). You can find out much more information about CTB word segmentation from the First, Second, or Third International Chinese Word Segmentation Bakeoff. In particular, you can now download a version of our CRF-based word segmenter (similar to the system we used in the Second Sighan Bakeoff) from our software page. However, for convenience, we also provide an ability for the parser to do word segmentation. Essentially, it misuses the parser as a first-order HMM Chinese word segmentation system. This gives a reasonable, but not excellent, Chinese word segmentation system. (It's performance isn't as good as the Stanford CRF word segmenter mentioned above.) To use it, you use the -segmentMarkov option or a grammar trained with this option. For example:

    $ iconv -f gb18030 -t utf8 < chinese-onesent-unseg.txt
    他在学校学习。
    $ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser xinhuaFactoredSegmenting.ser.gz chinese-onesent-unseg.txt | & iconv -f gb18030 -t utf-8
    Loading parser from serialized file xinhuaFactoredSegmenting.ser.gz ... done [6.8 sec].
    Parsing file: chinese-onesent-unseg.txt with 1 sentences.
    Parsing [sent. 1 len. 5]: 他 在 学校 学习 。
    Trying recovery parse...
    Sentence couldn't be parsed by grammar.... falling back to PCFG parse.
    (ROOT
      (IP
        (NP (PN 他))
        (VP
          (PP (P 在)
            (NP (NN 学校)))
          (VP (VV 学习)))
        (PU 。)))
    
    Parsed file: chinese-onesent-unseg.txt [1 sentences].
    Parsed 5 words in 1 sentences (6.08 wds/sec; 1.22 sents/sec).
      1 sentences were parsed by fallback to PCFG.
    

    Grammatical relations: The Chinese parser also supports grammatical relations (typed dependencies) output. For instance:

    $ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependencies xinhuaFactored.ser.gz chinese-onesent | & iconv -f gb18030 -t utf-8
    Loading parser from serialized file xinhuaFactored.ser.gz ... done [4.9 sec].
    Parsing file: chinese-onesent with 1 sentences.
    Parsing [sent. 1 len. 10]: 他 和 我 在 学校 里 常 打 桌球 。
    conj(我-3, 他-1)
    cc(我-3, 和-2)
    nsubj(打-8, 我-3)
    prep(打-8, 在-4)
    lobj(里-6, 学校-5)
    plmod(在-4, 里-6)
    advmod(打-8, 常-7)
    dobj(打-8, 桌球-9)
    
    Parsed file: chinese-onesent [1 sentences].
    Parsed 10 words in 1 sentences (7.10 wds/sec; 0.71 sents/sec).
    
  21. Can you give me some help in getting started parsing Arabic?

    Sure! See the Stanford Arabic Parser IAQ.

  22. Can I just use the parser as a vanilla PCFG parser?

    There are many kinds of 'vanilla', but, providing your treebank is in Penn Treebank format, then, yes, this is easy to do. You can train and test the parser as follows, assuming that your training trees are in train.txt and your test trees are in test.txt:

    java -server -mx2g edu.stanford.nlp.parser.lexparser.LexicalizedParser -PCFG -vMarkov 1 -uwm 0 -headFinder edu.stanford.nlp.trees.LeftHeadFinder -train train.txt -test test.txt >& output.log
    Going through the options, we ask for just the PCFG model (-PCFG), for just conditioning rules based on their left-hand side (parent), whereas the default also conditions on grandparents (-vMarkov 1), to use no language-specific heuristics for unknown word processing (-uwm 0), and to always just choose the left-most category on a rule RHS as the head (-headFinder edu.stanford.nlp.trees.LeftHeadFinder). Using a plain PCFG (i.e., no markovization of rules), the headFinder does not affect results, but unless you use this head finder, you will see errors about the parser not finding head categories (if your categories differ from those of the Penn Treebank).

  23. Can you give me complete documentation of command-line options/public APIs/included grammars/...?

    At present, we don't have any documentation beyond what you get in the download and what's on this page. If you would like to help by producing better documentation, feel free to write to parser-support@lists.stanford.edu.

    Some parser command-line options are documented. See the parser.lexparser package documentation, the LexicalizedParser.main method documentation, the TreePrint class, and the documentation of variables in the Train, Test, and Options classes, and appropriate language-particular TreebankLangParserParams. For the rest, you need to look at the source code. The public API is somewhat documented in the LexicalizedParser class JavaDoc. See especially the sample invocation in the parser.lexparser package documentation. The included file makeSerialized.csh effectively documents how the included grammars were made.

    The included file ParserDemo.java gives an example of how to call the parser programmatically, including getting Tree and typedDependencies output. It is reproduced below:

    import java.util.*;
    import edu.stanford.nlp.trees.*;
    import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
    
    class ParserDemo {
      public static void main(String[] args) {
        LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
        lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
    
        String[] sent = { "This", "is", "an", "easy", "sentence", "." };
        Tree parse = (Tree) lp.apply(Arrays.asList(sent));
        parse.pennPrint();
        System.out.println();
    
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();
    
        TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
        tp.printTree(parse);
      }
    }
    
  24. What output formats can I get with the -outputFormat and -outputFormatOptions options?

    You can give the options -outputFormat typedDependencies or -outputFormat typedDependenciesCollapsed to get typed dependencies (or grammatical relations) output (for English and Chinese only, currently). You can print out lexicalized trees (head words and tags at each phrasal node with the -outputFormatOptions lexicalize option. You can see all the other options by looking in the Javadoc of the TreePrint class.

  25. Can I have the parser run as a filter (that is, parse stuff typed in)?

    Yes, you use a filename of a single dash/minus character: -. E.g.,

    java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser englishPCFG.ser.gz -

    For interactive use, you may find it convenient to turn off the stderr output. For example, in bash you could use the command:

    java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser englishPCFG.ser.gz - 2> /dev/null