Package edu.stanford.nlp.pipeline

Linguistic Annotation Pipeline

See:
          Description

Class Summary
Annotation An annotation representing a span of text in a document.
ChunkAnnotationUtils Utility functions for annotating chunks
CoreMapAttributeAggregator Functions for aggregating token attributes
CoreMapAttributeAggregator.ConcatAggregator  
CoreMapAttributeAggregator.ConcatCoreMapListAggregator<T extends CoreMap>  
CoreMapAttributeAggregator.ConcatListAggregator<T>  
CoreMapAttributeAggregator.MostFreqAggregator  
LabeledChunkIdentifier Identifies chunks based on labels that uses IOB like encoding Assumes labels have the form - where the tag is a prefix indicating where in the chunk it is.
LabeledChunkIdentifier.LabelTagType Class representing a label, tag and type
 

Package edu.stanford.nlp.pipeline Description

Linguistic Annotation Pipeline

The point of this package is to enable people to quickly and painlessly get complete linguistic annotations of their text. It is designed to be highly flexible and extensible. I will first discuss the organization and functions of the classes, and then I will give some sample code and a run-down of the implemented Annotators.

Annotation

An Annotation is the data structure which holds the results of annotators. An Annotations is basically a map, from keys to bits of annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotations are designed to operate at the sentence-level, however depending on the Annotators you use this may not be how you choose to use the package.

Annotators

The backbone of this package are the Annotators. Annotators are a lot like functions, except that they operate over Annotations instead of Objects. They do things like tokenize, parse, or NER tag sentences. In the javadocs of your Annotator you should specify what the Annotator is assuming already exists (for instance, the NERAnnotator assumes that the sentence has been tokenized) and where to find these annotations (in the example from the previous set of parentheses, it would be WordsPLAnnotation.class). They should also specify what they add to the annotation, and where.

AnnotationPipeline

An AnnotationPipeline is where many Annotators are strung together to form a linguistic annotation pipeline. It is, itself, an Annotator. AnnotationPipelines usually also keep track of how much time they spend annotating and loading to assist users in finding where the time sinks are. However, the class AnnotationPipeline is not meant to be used as is. It serves as an example on how to build your own pipeline. If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP (described later in this document).

Sample Usage

Here is some sample code from PipelineTest which illustrates the intended usage of the package:
  public static void samplePipeline(String text) {
    AnnotationPipeline pipeline = new AnnotationPipeline();
    pipeline.addAnnotator(new PTBTokenizerAnnotator(false));
    pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
    pipeline.addAnnotator(new POSTaggerAnnotator(false));
    pipeline.addAnnotator(new MorphaAnnotator(false));
    pipeline.addAnnotator(new OldNERAnnotator(false));
    pipeline.addAnnotator(new ParserAnnotator(false, false));

    // create annotation with text
    DocumentAnnotation document = new DocumentAnnotation(text);

    // annotate text with pipeline
    pipeline.annotate(document);

    // iterate through sentences, tokens, etc.
    for (SentenceAnnotation sentence: document.get(SentencesAnnotation.class)) {
      Tree tree = sentence.get(TreeAnnotation.class);
      for (TokenAnnotation token: sentence.get(TokensAnnotation.class)) {
        String tokenText = token.get(TextAnnotation.class);
        String tokenPOS = token.get(PartOfSpeechAnnotation.class);
        String tokenLemma = token.get(LemmaAnnotation.class);
        String tokenNE = token.get(NamedEntityTagAnnotation.class);
        ...
      }
    }
  }

Existing Annotators

There already exist Annotators for many common tasks, all of which include default model locations, so they can just be used off the shelf. They are:

How Do I Use This?

You do not have to construct your pipeline from scratch! For the typical NL processors use StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization, lemmatization, POS tagging, NER, parsing and semantic role labeling. Read below on how to use this pipeline from the command line, or directly in your Java code.

Using StanfordCoreNLP from the Command Line

The command line for StanfordCoreNLP is:
./bin/stanfordcorenlp.sh
or
java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP <properties>
where the following properties are defined: (if -props or annotators is not defined, default properties will be loaded via the classpath)
        "annotators" - comma separated list of annotators
                The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, coref, dcoref, nfl

        If annotator "pos" is defined:
        "pos.model" - path towards the POS tagger model

        If annotator "ner" is defined:
        "ner.model.3class" - path towards the three-class NER model
        "ner.model.7class" - path towards the seven-class NER model
        "ner.model.MISCclass" - path towards the NER model with a MISC class

        If annotator "truecase" is defined:
        "truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz
        "truecase.bias" - class bias of the true case model; default: INIT_UPPER:-0.7,UPPER:-0.7,O:0
        "truecase.mixedcasefile" - path towards the mixed case file; default: StanfordCoreNLPModels/truecase/MixDisambiguation.list

        If annotator "nfl" is defined:
        "nfl.gazetteer" - path towards the gazetteer for the NFL domain
        "nfl.relation.model" - path towards the NFL relation extraction model

        If annotator "parse" is defined:
        "parser.model" - path towards the PCFG parser model

Command line properties:
        "file" - run the pipeline on the contents of this file, or on the contents of the files in this directory
                 XML output is generated for every input file "file" as file.xml
        "extension" - if -file used with a directory, process only the files with this extension
        "filelist" - run the pipeline on the list of files given in this file
                     XML output is generated for every input file as file.outputExtension
        "outputDirectory" - where to put XML output (defaults to the current directory)
        "outputExtension" - extension to use for the output file (defaults to ".xml").  Don't forget the dot!
        "replaceExtension" - flag to chop off the last extension before adding outputExtension to file
    "noClobber" - don't automatically override (clobber) output files that already exist
If none of the above are present, run the pipeline in an interactive shell (default properties will be loaded from the classpath). The shell accepts input from stdin and displays the output at stdout. To avoid clutter in the command line you can store some or all of these properties in a properties file and pass this file to StanfordCoreNLP using the -props option. For example, my pipe.properties file contains the following:
annotators=tokenize,ssplit,pos,lemma,ner,parse,coref
pos.model=models/left3words-wsj-0-18.tagger
ner.model.3class=models/ner-en-3class.crf.gz
ner.model.7class=models/muc.7class.crf.gz
ner.model.distsim=models/conll.distsim.crf.ser.gz
#nfl.gazetteer = models/NFLgazetteer.txt
#nfl.relation.model = models/nfl_relation_model.ser
parser.model=models/englishPCFG.ser.gz
coref.model=models/coref/corefClassifierAll.March2009.ser.gz
coref.name.dir=models/coref
wordnet.dir=models/wordnet-3.0-prolog
Using this properties file, I run the pipeline's interactive shell as follows:
java -cp classes/:lib/xom.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.properties
In the above setup, the system displays a shell-like prompt and waits for stdin input. You can input any English text. Processing starts after each new line and the output is displayed at the standard output in a format (somewhat) interpretable by humans. For example, for the input "Reagan announced he had Alzheimer's disease, an incurable brain affliction." the shell displays the following output:
[Text=Reagan PartOfSpeech=NNP Lemma=Reagan NamedEntityTag=PERSON] [Text=announced PartOfSpeech=VBD Lemma=announce NamedEntityTag=O] [Text=he PartOfSpeech=PRP Lemma=he NamedEntityTag=O] [Text=had PartOfSpeech=VBD Lemma=have NamedEntityTag=O] [Text=Alzheimer PartOfSpeech=NNP Lemma=Alzheimer NamedEntityTag=O] [Text='s PartOfSpeech=POS Lemma='s NamedEntityTag=O] [Text=disease PartOfSpeech=NN Lemma=disease NamedEntityTag=O] [Text=, PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=an PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=incurable PartOfSpeech=JJ Lemma=incurable NamedEntityTag=O] [Text=brain PartOfSpeech=NN Lemma=brain NamedEntityTag=O] [Text=affliction PartOfSpeech=NN Lemma=affliction NamedEntityTag=O] [Text=. PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
  (S
    (NP (NNP Reagan))
    (VP (VBD announced)
      (SBAR
        (S
          (NP (PRP he))
          (VP (VBD had)
            (NP
              (NP
                (NP (NNP Alzheimer) (POS 's))
                (NN disease))
              (, ,)
              (NP (DT an) (JJ incurable) (NN brain) (NN affliction)))))))
    (. .)))

nsubj(announced-2, Reagan-1)
nsubj(had-4, he-3)
ccomp(announced-2, had-4)
poss(disease-7, Alzheimer-5)
dobj(had-4, disease-7)
det(affliction-12, an-9)
amod(affliction-12, incurable-10)
nn(affliction-12, brain-11)
appos(disease-7, affliction-12)
where the first part of the output shows the individual words and their attributes, e.g., POS and NE tags, the second block shows the constituent parse tree, and the last block shows the syntactic dependencies extracted from the parse tree. Note that the coreference chains are stored in the individual words. For example, the referent for the "he" pronoun is stored as "CorefDest=1 1", which means that the referent is the first token in the first sentence in this text, i.e., "Reagan".

Alternatively, if you want to process all the .txt files in the directory data/, use this command line:

java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.properties -file data -extension .txt
Or, you can store all the files that you want processed one per line in a separate file, and pass the latter file to StanfordCoreNLP with the following options:
java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.properties -filelist list_of_files_to_process.txt
In the latter cases the pipeline generates a file.txt.xml output file for every file.txt it processes. For example, if file.txt contains the following text:
Federal Reserve Chairman Ben Bernanke declared Friday
that the U.S. economy is on the verge of a long-awaited recovery.
the pipeline generates the following XML output in file.txt.xml:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://nlp.stanford.edu">
  <sentence>
    <wordTable>
      <wordInfo id="1">
        <word>Federal</word>
        <lemma>Federal</lemma>
        <POS>NNP</POS>
        <NER>ORGANIZATION</NER>
      </wordInfo>
      <wordInfo id="2">
        <word>Reserve</word>
        <lemma>Reserve</lemma>
        <POS>NNP</POS>
        <NER>ORGANIZATION</NER>
      </wordInfo>
      <wordInfo id="3">
        <word>Chairman</word>
        <lemma>Chairman</lemma>
        <POS>NNP</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="4">
        <word>Ben</word>
        <lemma>Ben</lemma>
        <POS>NNP</POS>
        <NER>PERSON</NER>
      </wordInfo>
      <wordInfo id="5">
        <word>Bernanke</word>
        <lemma>Bernanke</lemma>
        <POS>NNP</POS>
        <NER>PERSON</NER>
      </wordInfo>
      <wordInfo id="6">
        <word>declared</word>
        <lemma>declare</lemma>
        <POS>VBD</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="7">
        <word>Friday</word>
        <lemma>Friday</lemma>
        <POS>NNP</POS>
        <NER>DATE</NER>
      </wordInfo>
      <wordInfo id="8">
        <word>that</word>
        <lemma>that</lemma>
        <POS>IN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="9">
        <word>the</word>
        <lemma>the</lemma>
        <POS>DT</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="10">
        <word>U.S.</word>
        <lemma>U.S.</lemma>
        <POS>NNP</POS>
        <NER>LOCATION</NER>
      </wordInfo>
      <wordInfo id="11">
        <word>economy</word>
        <lemma>economy</lemma>
        <POS>NN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="12">
        <word>is</word>
        <lemma>be</lemma>
        <POS>VBZ</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="13">
        <word>on</word>
        <lemma>on</lemma>
        <POS>IN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="14">
        <word>the</word>
        <lemma>the</lemma>
        <POS>DT</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="15">
        <word>verge</word>
        <lemma>verge</lemma>
        <POS>NN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="16">
        <word>of</word>
        <lemma>of</lemma>
        <POS>IN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="17">
        <word>a</word>
        <lemma>a</lemma>
        <POS>DT</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="18">
        <word>long-awaited</word>
        <lemma>long-awaited</lemma>
        <POS>JJ</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="19">
        <word>recovery</word>
        <lemma>recovery</lemma>
        <POS>NN</POS>
        <NER>O</NER>
      </wordInfo>
      <wordInfo id="20">
        <word>.</word>
        <lemma>.</lemma>
        <POS>.</POS>
        <NER>O</NER>
      </wordInfo>
    </wordTable>
    <parse>(ROOT
  (S
    (NP (NNP Federal) (NNP Reserve) (NNP Chairman) (NNP Ben) (NNP Bernanke))
    (VP (VBD declared)
      (NP-TMP (NNP Friday))
      (SBAR (IN that)
        (S
          (NP (DT the) (NNP U.S.) (NN economy))
          (VP (VBZ is)
            (PP (IN on)
              (NP
                (NP (DT the) (NN verge))
                (PP (IN of)
                  (NP (DT a) (JJ long-awaited) (NN recovery)))))))))
    (. .)))</parse>
    <dependencies>
      <dep type="nn">
        <governor idx="5">Bernanke</governor>
        <dependent idx="1">Federal</dependent>
      </dep>
      <dep type="nn">
        <governor idx="5">Bernanke</governor>
        <dependent idx="2">Reserve</dependent>
      </dep>
      <dep type="nn">
        <governor idx="5">Bernanke</governor>
        <dependent idx="3">Chairman</dependent>
      </dep>
      <dep type="nn">
        <governor idx="5">Bernanke</governor>
        <dependent idx="4">Ben</dependent>
      </dep>
      <dep type="nsubj">
        <governor idx="7">Friday</governor>
        <dependent idx="5">Bernanke</dependent>
      </dep>
      <dep type="dep">
        <governor idx="7">Friday</governor>
        <dependent idx="6">declared</dependent>
      </dep>
      <dep type="complm">
        <governor idx="12">is</governor>
        <dependent idx="8">that</dependent>
      </dep>
      <dep type="det">
        <governor idx="11">economy</governor>
        <dependent idx="9">the</dependent>
      </dep>
      <dep type="nn">
        <governor idx="11">economy</governor>
        <dependent idx="10">U.S.</dependent>
      </dep>
      <dep type="nsubj">
        <governor idx="12">is</governor>
        <dependent idx="11">economy</dependent>
      </dep>
      <dep type="ccomp">
        <governor idx="7">Friday</governor>
        <dependent idx="12">is</dependent>
      </dep>
      <dep type="prep">
        <governor idx="12">is</governor>
        <dependent idx="13">on</dependent>
      </dep>
      <dep type="det">
        <governor idx="15">verge</governor>
        <dependent idx="14">the</dependent>
      </dep>
      <dep type="pobj">
        <governor idx="13">on</governor>
        <dependent idx="15">verge</dependent>
      </dep>
      <dep type="prep">
        <governor idx="15">verge</governor>
        <dependent idx="16">of</dependent>
      </dep>
      <dep type="det">
        <governor idx="19">recovery</governor>
        <dependent idx="17">a</dependent>
      </dep>
      <dep type="amod">
        <governor idx="19">recovery</governor>
        <dependent idx="18">long-awaited</dependent>
      </dep>
      <dep type="pobj">
        <governor idx="16">of</governor>
        <dependent idx="19">recovery</dependent>
      </dep>
    </dependencies>
  </sentence>
</root>

If the NFL annotator is enabled, additional XML output is generated for the corresponding domain-specific entities and relations. For example, for the sentence "The 49ers beat Dallas 20-10 in the Sunday game." the NFL-specific output is:

    <MachineReading>
      <entities>
        <entity id="EntityMention1">
          <type>NFLTeam</type>
          <span start="1" end="2" />
        </entity>
        <entity id="EntityMention2">
          <type>NFLTeam</type>
          <span start="3" end="4" />
        </entity>
        <entity id="EntityMention3">
          <type>FinalScore</type>
          <span start="4" end="5" />
        </entity>
        <entity id="EntityMention4">
          <type>FinalScore</type>
          <span start="6" end="7" />
        </entity>
        <entity id="EntityMention5">
          <type>Date</type>
          <span start="9" end="10" />
        </entity>
        <entity id="EntityMention6">
          <type>NFLGame</type>
          <span start="10" end="11" />
        </entity>
      </entities>
      <relations>
        <relation id="RelationMention-11">
          <type>teamScoringAll</type>
          <arguments>
            <entity id="EntityMention3">
              <type>FinalScore</type>
              <span start="4" end="5" />
            </entity>
            <entity id="EntityMention1">
              <type>NFLTeam</type>
              <span start="1" end="2" />
            </entity>
          </arguments>
        </relation>
        <relation id="RelationMention-17">
          <type>teamScoringAll</type>
          <arguments>
            <entity id="EntityMention4">
              <type>FinalScore</type>
              <span start="6" end="7" />
            </entity>
            <entity id="EntityMention2">
              <type>NFLTeam</type>
              <span start="3" end="4" />
            </entity>
          </arguments>
        </relation>
        <relation id="RelationMention-20">
          <type>teamFinalScore</type>
          <arguments>
            <entity id="EntityMention4">
              <type>FinalScore</type>
              <span start="6" end="7" />
            </entity>
            <entity id="EntityMention6">
              <type>NFLGame</type>
              <span start="10" end="11" />
            </entity>
          </arguments>
        </relation>
        <relation id="RelationMention-25">
          <type>gameDate</type>
          <arguments>
            <entity id="EntityMention5">
              <type>Date</type>
              <span start="9" end="10" />
            </entity>
            <entity id="EntityMention6">
              <type>NFLGame</type>
              <span start="10" end="11" />
            </entity>
          </arguments>
        </relation>
      </relations>
    </MachineReading>

The StanfordCoreNLP API

To construct a pipeline object from a given set of properties, use StanfordCoreNLP(Properties props). This method creates the pipeline using the annotators given in the "annotators" property (see above for the complete list of properties). Currently, we support the following options for the "annotators" property:

To run the pipeline over some text, use StanfordCoreNLP.process(Reader reader). This method returns an Annotation object, which stores all the annotations generated for the given text. To access these annotations use the following methods:

Author:
Jenny Finkel, Mihai Surdeanu, Steven Bethard, David McClosky
Last modified: Oct 7, 2010



Stanford NLP Group