|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
---|---|
Annotation | An annotation representing a span of text in a document. |
ChunkAnnotationUtils | Utility functions for annotating chunks |
CoreMapAttributeAggregator | Functions for aggregating token attributes |
CoreMapAttributeAggregator.ConcatAggregator | |
CoreMapAttributeAggregator.ConcatCoreMapListAggregator<T extends CoreMap> | |
CoreMapAttributeAggregator.ConcatListAggregator<T> | |
CoreMapAttributeAggregator.MostFreqAggregator | |
LabeledChunkIdentifier | Identifies chunks based on labels that uses IOB like encoding
Assumes labels have the form |
LabeledChunkIdentifier.LabelTagType | Class representing a label, tag and type |
WordsPLAnnotation.class
). They should also specify what they add
to the annotation, and where.
public static void samplePipeline(String text) { AnnotationPipeline pipeline = new AnnotationPipeline(); pipeline.addAnnotator(new PTBTokenizerAnnotator(false)); pipeline.addAnnotator(new WordsToSentencesAnnotator(false)); pipeline.addAnnotator(new POSTaggerAnnotator(false)); pipeline.addAnnotator(new MorphaAnnotator(false)); pipeline.addAnnotator(new OldNERAnnotator(false)); pipeline.addAnnotator(new ParserAnnotator(false, false)); // create annotation with text DocumentAnnotation document = new DocumentAnnotation(text); // annotate text with pipeline pipeline.annotate(document); // iterate through sentences, tokens, etc. for (SentenceAnnotation sentence: document.get(SentencesAnnotation.class)) { Tree tree = sentence.get(TreeAnnotation.class); for (TokenAnnotation token: sentence.get(TokensAnnotation.class)) { String tokenText = token.get(TextAnnotation.class); String tokenPOS = token.get(PartOfSpeechAnnotation.class); String tokenLemma = token.get(LemmaAnnotation.class); String tokenNE = token.get(NamedEntityTagAnnotation.class); ... } } }
./bin/stanfordcorenlp.shor
java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP <properties>where the following properties are defined: (if
-props
or annotators
is not defined, default properties will be loaded via the classpath)
"annotators" - comma separated list of annotators The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, coref, dcoref, nfl If annotator "pos" is defined: "pos.model" - path towards the POS tagger model If annotator "ner" is defined: "ner.model.3class" - path towards the three-class NER model "ner.model.7class" - path towards the seven-class NER model "ner.model.MISCclass" - path towards the NER model with a MISC class If annotator "truecase" is defined: "truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz "truecase.bias" - class bias of the true case model; default: INIT_UPPER:-0.7,UPPER:-0.7,O:0 "truecase.mixedcasefile" - path towards the mixed case file; default: StanfordCoreNLPModels/truecase/MixDisambiguation.list If annotator "nfl" is defined: "nfl.gazetteer" - path towards the gazetteer for the NFL domain "nfl.relation.model" - path towards the NFL relation extraction model If annotator "parse" is defined: "parser.model" - path towards the PCFG parser model Command line properties: "file" - run the pipeline on the contents of this file, or on the contents of the files in this directory XML output is generated for every input file "file" as file.xml "extension" - if -file used with a directory, process only the files with this extension "filelist" - run the pipeline on the list of files given in this file XML output is generated for every input file as file.outputExtension "outputDirectory" - where to put XML output (defaults to the current directory) "outputExtension" - extension to use for the output file (defaults to ".xml"). Don't forget the dot! "replaceExtension" - flag to chop off the last extension before adding outputExtension to file "noClobber" - don't automatically override (clobber) output files that already existIf none of the above are present, run the pipeline in an interactive shell (default properties will be loaded from the classpath). The shell accepts input from stdin and displays the output at stdout. To avoid clutter in the command line you can store some or all of these properties in a properties file and pass this file to
StanfordCoreNLP
using the -props
option. For example,
my pipe.properties
file contains the following:
annotators=tokenize,ssplit,pos,lemma,ner,parse,coref pos.model=models/left3words-wsj-0-18.tagger ner.model.3class=models/ner-en-3class.crf.gz ner.model.7class=models/muc.7class.crf.gz ner.model.distsim=models/conll.distsim.crf.ser.gz #nfl.gazetteer = models/NFLgazetteer.txt #nfl.relation.model = models/nfl_relation_model.ser parser.model=models/englishPCFG.ser.gz coref.model=models/coref/corefClassifierAll.March2009.ser.gz coref.name.dir=models/coref wordnet.dir=models/wordnet-3.0-prologUsing this properties file, I run the pipeline's interactive shell as follows:
java -cp classes/:lib/xom.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.propertiesIn the above setup, the system displays a shell-like prompt and waits for stdin input. You can input any English text. Processing starts after each new line and the output is displayed at the standard output in a format (somewhat) interpretable by humans. For example, for the input "Reagan announced he had Alzheimer's disease, an incurable brain affliction." the shell displays the following output:
[Text=Reagan PartOfSpeech=NNP Lemma=Reagan NamedEntityTag=PERSON] [Text=announced PartOfSpeech=VBD Lemma=announce NamedEntityTag=O] [Text=he PartOfSpeech=PRP Lemma=he NamedEntityTag=O] [Text=had PartOfSpeech=VBD Lemma=have NamedEntityTag=O] [Text=Alzheimer PartOfSpeech=NNP Lemma=Alzheimer NamedEntityTag=O] [Text='s PartOfSpeech=POS Lemma='s NamedEntityTag=O] [Text=disease PartOfSpeech=NN Lemma=disease NamedEntityTag=O] [Text=, PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=an PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=incurable PartOfSpeech=JJ Lemma=incurable NamedEntityTag=O] [Text=brain PartOfSpeech=NN Lemma=brain NamedEntityTag=O] [Text=affliction PartOfSpeech=NN Lemma=affliction NamedEntityTag=O] [Text=. PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP (NNP Reagan)) (VP (VBD announced) (SBAR (S (NP (PRP he)) (VP (VBD had) (NP (NP (NP (NNP Alzheimer) (POS 's)) (NN disease)) (, ,) (NP (DT an) (JJ incurable) (NN brain) (NN affliction))))))) (. .))) nsubj(announced-2, Reagan-1) nsubj(had-4, he-3) ccomp(announced-2, had-4) poss(disease-7, Alzheimer-5) dobj(had-4, disease-7) det(affliction-12, an-9) amod(affliction-12, incurable-10) nn(affliction-12, brain-11) appos(disease-7, affliction-12)where the first part of the output shows the individual words and their attributes, e.g., POS and NE tags, the second block shows the constituent parse tree, and the last block shows the syntactic dependencies extracted from the parse tree. Note that the coreference chains are stored in the individual words. For example, the referent for the "he" pronoun is stored as "CorefDest=1 1", which means that the referent is the first token in the first sentence in this text, i.e., "Reagan".
Alternatively, if you want to process all the .txt files in the directory data/, use this command line:
java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.properties -file data -extension .txtOr, you can store all the files that you want processed one per line in a separate file, and pass the latter file to StanfordCoreNLP with the following options:
java -cp classes/:lib/xom.jar -Xmx6g edu.stanford.nlp.pipeline.StanfordCoreNLP -props pipe.properties -filelist list_of_files_to_process.txtIn the latter cases the pipeline generates a file.txt.xml output file for every file.txt it processes. For example, if file.txt contains the following text:
Federal Reserve Chairman Ben Bernanke declared Friday that the U.S. economy is on the verge of a long-awaited recovery.the pipeline generates the following XML output in file.txt.xml:
<?xml version="1.0" encoding="UTF-8"?> <root xmlns="http://nlp.stanford.edu"> <sentence> <wordTable> <wordInfo id="1"> <word>Federal</word> <lemma>Federal</lemma> <POS>NNP</POS> <NER>ORGANIZATION</NER> </wordInfo> <wordInfo id="2"> <word>Reserve</word> <lemma>Reserve</lemma> <POS>NNP</POS> <NER>ORGANIZATION</NER> </wordInfo> <wordInfo id="3"> <word>Chairman</word> <lemma>Chairman</lemma> <POS>NNP</POS> <NER>O</NER> </wordInfo> <wordInfo id="4"> <word>Ben</word> <lemma>Ben</lemma> <POS>NNP</POS> <NER>PERSON</NER> </wordInfo> <wordInfo id="5"> <word>Bernanke</word> <lemma>Bernanke</lemma> <POS>NNP</POS> <NER>PERSON</NER> </wordInfo> <wordInfo id="6"> <word>declared</word> <lemma>declare</lemma> <POS>VBD</POS> <NER>O</NER> </wordInfo> <wordInfo id="7"> <word>Friday</word> <lemma>Friday</lemma> <POS>NNP</POS> <NER>DATE</NER> </wordInfo> <wordInfo id="8"> <word>that</word> <lemma>that</lemma> <POS>IN</POS> <NER>O</NER> </wordInfo> <wordInfo id="9"> <word>the</word> <lemma>the</lemma> <POS>DT</POS> <NER>O</NER> </wordInfo> <wordInfo id="10"> <word>U.S.</word> <lemma>U.S.</lemma> <POS>NNP</POS> <NER>LOCATION</NER> </wordInfo> <wordInfo id="11"> <word>economy</word> <lemma>economy</lemma> <POS>NN</POS> <NER>O</NER> </wordInfo> <wordInfo id="12"> <word>is</word> <lemma>be</lemma> <POS>VBZ</POS> <NER>O</NER> </wordInfo> <wordInfo id="13"> <word>on</word> <lemma>on</lemma> <POS>IN</POS> <NER>O</NER> </wordInfo> <wordInfo id="14"> <word>the</word> <lemma>the</lemma> <POS>DT</POS> <NER>O</NER> </wordInfo> <wordInfo id="15"> <word>verge</word> <lemma>verge</lemma> <POS>NN</POS> <NER>O</NER> </wordInfo> <wordInfo id="16"> <word>of</word> <lemma>of</lemma> <POS>IN</POS> <NER>O</NER> </wordInfo> <wordInfo id="17"> <word>a</word> <lemma>a</lemma> <POS>DT</POS> <NER>O</NER> </wordInfo> <wordInfo id="18"> <word>long-awaited</word> <lemma>long-awaited</lemma> <POS>JJ</POS> <NER>O</NER> </wordInfo> <wordInfo id="19"> <word>recovery</word> <lemma>recovery</lemma> <POS>NN</POS> <NER>O</NER> </wordInfo> <wordInfo id="20"> <word>.</word> <lemma>.</lemma> <POS>.</POS> <NER>O</NER> </wordInfo> </wordTable> <parse>(ROOT (S (NP (NNP Federal) (NNP Reserve) (NNP Chairman) (NNP Ben) (NNP Bernanke)) (VP (VBD declared) (NP-TMP (NNP Friday)) (SBAR (IN that) (S (NP (DT the) (NNP U.S.) (NN economy)) (VP (VBZ is) (PP (IN on) (NP (NP (DT the) (NN verge)) (PP (IN of) (NP (DT a) (JJ long-awaited) (NN recovery))))))))) (. .)))</parse> <dependencies> <dep type="nn"> <governor idx="5">Bernanke</governor> <dependent idx="1">Federal</dependent> </dep> <dep type="nn"> <governor idx="5">Bernanke</governor> <dependent idx="2">Reserve</dependent> </dep> <dep type="nn"> <governor idx="5">Bernanke</governor> <dependent idx="3">Chairman</dependent> </dep> <dep type="nn"> <governor idx="5">Bernanke</governor> <dependent idx="4">Ben</dependent> </dep> <dep type="nsubj"> <governor idx="7">Friday</governor> <dependent idx="5">Bernanke</dependent> </dep> <dep type="dep"> <governor idx="7">Friday</governor> <dependent idx="6">declared</dependent> </dep> <dep type="complm"> <governor idx="12">is</governor> <dependent idx="8">that</dependent> </dep> <dep type="det"> <governor idx="11">economy</governor> <dependent idx="9">the</dependent> </dep> <dep type="nn"> <governor idx="11">economy</governor> <dependent idx="10">U.S.</dependent> </dep> <dep type="nsubj"> <governor idx="12">is</governor> <dependent idx="11">economy</dependent> </dep> <dep type="ccomp"> <governor idx="7">Friday</governor> <dependent idx="12">is</dependent> </dep> <dep type="prep"> <governor idx="12">is</governor> <dependent idx="13">on</dependent> </dep> <dep type="det"> <governor idx="15">verge</governor> <dependent idx="14">the</dependent> </dep> <dep type="pobj"> <governor idx="13">on</governor> <dependent idx="15">verge</dependent> </dep> <dep type="prep"> <governor idx="15">verge</governor> <dependent idx="16">of</dependent> </dep> <dep type="det"> <governor idx="19">recovery</governor> <dependent idx="17">a</dependent> </dep> <dep type="amod"> <governor idx="19">recovery</governor> <dependent idx="18">long-awaited</dependent> </dep> <dep type="pobj"> <governor idx="16">of</governor> <dependent idx="19">recovery</dependent> </dep> </dependencies> </sentence> </root>
If the NFL annotator is enabled, additional XML output is generated for the corresponding domain-specific entities and relations. For example, for the sentence "The 49ers beat Dallas 20-10 in the Sunday game." the NFL-specific output is:
<MachineReading> <entities> <entity id="EntityMention1"> <type>NFLTeam</type> <span start="1" end="2" /> </entity> <entity id="EntityMention2"> <type>NFLTeam</type> <span start="3" end="4" /> </entity> <entity id="EntityMention3"> <type>FinalScore</type> <span start="4" end="5" /> </entity> <entity id="EntityMention4"> <type>FinalScore</type> <span start="6" end="7" /> </entity> <entity id="EntityMention5"> <type>Date</type> <span start="9" end="10" /> </entity> <entity id="EntityMention6"> <type>NFLGame</type> <span start="10" end="11" /> </entity> </entities> <relations> <relation id="RelationMention-11"> <type>teamScoringAll</type> <arguments> <entity id="EntityMention3"> <type>FinalScore</type> <span start="4" end="5" /> </entity> <entity id="EntityMention1"> <type>NFLTeam</type> <span start="1" end="2" /> </entity> </arguments> </relation> <relation id="RelationMention-17"> <type>teamScoringAll</type> <arguments> <entity id="EntityMention4"> <type>FinalScore</type> <span start="6" end="7" /> </entity> <entity id="EntityMention2"> <type>NFLTeam</type> <span start="3" end="4" /> </entity> </arguments> </relation> <relation id="RelationMention-20"> <type>teamFinalScore</type> <arguments> <entity id="EntityMention4"> <type>FinalScore</type> <span start="6" end="7" /> </entity> <entity id="EntityMention6"> <type>NFLGame</type> <span start="10" end="11" /> </entity> </arguments> </relation> <relation id="RelationMention-25"> <type>gameDate</type> <arguments> <entity id="EntityMention5"> <type>Date</type> <span start="9" end="10" /> </entity> <entity id="EntityMention6"> <type>NFLGame</type> <span start="10" end="11" /> </entity> </arguments> </relation> </relations> </MachineReading>
To construct a pipeline object from a given set of properties, use StanfordCoreNLP(Properties props). This method creates the pipeline using the annotators given in the "annotators" property (see above for the complete list of properties). Currently, we support the following options for the "annotators" property:
To run the pipeline over some text, use StanfordCoreNLP.process(Reader reader). This method returns an Annotation object, which stores all the annotations generated for the given text. To access these annotations use the following methods:
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |