edu.stanford.nlp.ie (Stanford JavaNLP API)

Interface Summary
Interface Description

KBPRelationExtractor
An interface for a KBP-style relation extractor

PriorModelFactory<IN extends CoreMap>

Interface Summary
Interface	Description
KBPRelationExtractor	An interface for a KBP-style relation extractor
PriorModelFactory<IN extends CoreMap>

Class Summary
Class	Description
AbstractSequenceClassifier<IN extends CoreMap>	This class provides common functionality for (probabilistic) sequence models.
ChineseMorphFeatureSets	A class for holding Chinese morphological features used for word segmentation and POS tagging.
ChineseQuantifiableEntityNormalizer	A Chinese correspondence of the `QuantifiableEntityNormalizer` that normalizes NUMBER, DATE, TIME, MONEY, PERCENT and ORDINAL amounts expressed in Chinese.
ClassifierCombiner<IN extends CoreMap & HasWord>	Merges the outputs of two or more AbstractSequenceClassifiers according to a simple precedence scheme: any given base classifier contributes only classifications of labels that do not exist in the base classifiers specified before, and that do not have any token overlap with labels assigned by higher priority classifiers.
EmbeddingFeatureFactory	For features generated from word embeddings.
EmpiricalNERPrior<IN extends CoreMap>	This was the empirical NER prior used for long distance consistency in the Finkel et al.
EmpiricalNERPriorBIO<IN extends CoreMap>
EmpiricalNERPriorBIOFactory<IN extends CoreMap>
EmpiricalNERPriorFactory<IN extends CoreMap>	Used for creating an NER prior by reflection.
EntityCachingAbstractSequencePrior<IN extends CoreMap>	This class keeps track of all labeled entities and updates its list whenever the label at a point gets changed.
EntityCachingAbstractSequencePriorBIO<IN extends CoreMap>	This class keeps track of all labeled entities and updates the its list whenever the label at a point gets changed.
KBPBasicSpanishCorefSystem	Perform basic coreference for Spanish
KBPEnsembleExtractor	An ensemble of other KBP relation extractors.
KBPRelationExtractor.Accuracy	A class to compute the accuracy of a relation extractor.
KBPRelationExtractor.KBPInput
KBPSemgrexExtractor	A tokensregex extractor for KBP.
KBPStatisticalExtractor	A relation extractor to work with Victor's new KBP data.
KBPTokensregexExtractor	A tokensregex extractor for KBP.
KBPTokensregexExtractor.Object	IMPORTANT: Don't rename this class without updating the rules defs file.
KBPTokensregexExtractor.Subject	IMPORTANT: Don't rename this class without updating the rules defs file.
NERClassifierCombiner	Subclass of ClassifierCombiner that behaves like a NER, by copying the AnswerAnnotation labels to NERAnnotation.
NERFeatureFactory<IN extends CoreLabel>	Features for Named Entity Recognition.
NERFeatureFactory.FeatureCollector	This class handles collecting features into a set, in a more memory efficient way.
NERGUI
NERServer	A named-entity recognizer server for Stanford's NER.
NERServer.NERClient	This example sends material to the NER server one line at a time.
NumberNormalizer	Provides functions for converting words to numbers.
PresetSequenceClassifier<IN extends CoreMap>	Created by jebolton on 7/14/17.
QuantifiableEntityNormalizer	Various methods for normalizing Money, Date, Percent, Time, and Number, Ordinal amounts.
UniformPrior<IN extends CoreMap>	Uniform prior to be used for generic Gibbs inference in the ie.crf.CRFClassifier.
UniformPriorFactory<IN extends CoreMap>

Enum Summary
Enum	Description
KBPRelationExtractor.NERTag	A list of valid KBP NER tags.
KBPRelationExtractor.RelationType	Known relation types (last updated for the 2013 shared task).
KBPRelationExtractor.RelationType.Cardinality
NERClassifierCombiner.Language

Package edu.stanford.nlp.ie Description

This package implements various subpackages for information extraction. Some examples of use appear later in this description. At the moment, three types of information extraction are supported (where some of these have internal variants):

Regular expression based matching: These extractors are hand-written and match whatever the regular expression matches.
Conditional Random Fields classifier: A sequence tagger based on CRF model that can be used for NER tagging and other sequence labeling tasks.
Conditional Markov Model classifier: A classifier based on CMM model that can be used for NER tagging and other labeling tasks.
Hidden Markov model based extractors: These can be either single field extractors or two level HMMs where the individual component models and how they are glued together is trained separately. These models are trained automatically, but require tagged training data.
Description extractor: This does higher level NLP analysis of sentences (using a POS tagger and chunker) to find sentences that describe an object. This might be a biography of a person, or a description of an animal. This module is fixed: there is nothing to write or train (unless one wants to start to change its internal behavior).

There are some demonstrations of the stuff here which you can run (and several other classes have main() methods which exhibit their functionality):

NERGUI is a simple GUI front-end to the NER tagging components.
crf/NERGUI is a simple GUI front-end to the CRF-based NER tagging components. This version only supports the CRF-based NER tagger.
demo/NERDemo is a simple class examplifying the programmatical use of the CRF-based NER tagger.

Usage examples

0. Setup: For all of these examples except 3., you need to be connected to the Internet, and for the application's web search module to be able to connect to search engines. The web search functionality is provided by the supplied edu.stanford.nlp.web package. How web search works is controlled by a websearch.init file in your current directory (or if none is present, you will get search results from AltaVista). If you are registered to use the GoogleAPI, you should probably edit this file so web queries can be done to Google using their SOAP interface. Even if not, you can specify additional or different search engines to access in websearch.init. A copy of this file is supplied in the distribution. The DescExtractor in 4. also requires another init file so that it can use the include part-of-speech tagger.

1. Corporate Contact Information. This illustrates simple information extraction from a web page. Using the included ExtractDemo.bat or by hand run: java edu.stanford.nlp.ie.ExtractDemo

Select as Extractor Directory the folder: serialized-extractors/companycontact
Select as an Ontology the one in serialized-extractors/companycontact/Corporation-Information.kaon
Enter Corporation as the Concept to extract.
You can then do various searches:
- You can enter a URL, click Extract, and look at the results:
  - http://www.ziatech.com/
  - http://www.cs.stanford.edu/
  - http://www.ananova.com/business/story/sm_635565.html
  The components will work reasonably well on clean-ish text pages like this. They work even better on text such as newswire or press releases, as one can demonstrate either over the web or using the command line extractor
- You can do a search for a term and get extraction from the top search hits, by entering a term in the "Search for words" box and pressing "Extract":
  - Audiovox Corporation
  Extraction is done over a number of pages from a search engine, and the results from each are shown. Typically some of these pages will have suitable content to extract, and some just won't.

2. Corporate Contact Information merged. This illustrates the addition of information merger across web pages. Using the included MergeExtractDemo.bat or similarly do:

java edu.stanford.nlp.ie.ExtractDemo -m

The ExtractDemo screen is similar, but adds a button to Select a Merger.

Select an Extractor Directory and Ontology as above.
Click on "Select Merger" and then navigate to serialized-extractors/mergers and Select the file unscoredmerger.obj.
Enter the concept "Corporation" as before.
One can now do search as above, by URL or search, but Merger is only appropriate to a word search with multiple results. Try Search for words:
- Audiovox Corporation
and press "Extract". Results gradually appear. After all results have been processed (this may take a few seconds), a Merged best extracted information result will be produced and displayed as the first of the results. "Merged Instance" will appear on the bottom line corresponding to it, rather than a URL.

3. Company names via direct use of an HMM information extractor. One can also train, load, and use HMM information extractors directly, without using any of the RDF-based KAON framework (http://kaon.semanticweb.org/) used by ExtractDemo.

The file edu.stanford.nlp.ie.hmm.Tester illustrates the use of a pretrained HMM on data via the command line interface:
- cd serialized-extractors/companycontact/
- java edu.stanford.nlp.ie.hmm.Tester cisco.txt company company-name.hmm
- java edu.stanford.nlp.ie.hmm.Tester EarningsReports.txt company company-name.hmm
- java edu.stanford.nlp.ie.hmm.Tester companytest.txt company company-name.hmm
The first shows the HMM running on an unmarked up file with a single document. The second shows a Corpus of several documents, separated with ENDOFDOC, used as a document delimiter inside a Corpus. This second use of Tester expects to normally have an annotated corpus on which it can score its answers. Here, the corpus is unannotated, and so some of the output is inappropriate, but it shows what is selected as the company name for each document (it's mostly correct...). The final example shows it running on a corpus that does have answers marked in it. It does the testing with the XML elements stripped, but then uses them to evaluate correctness.
To train one's own HMM, one needs data where one or more fields is annotated in the data in the style of an XML element, with all the documents in one file, separated by lines with ENDOFDOC on them. Then one can train (and then test) as follows. Training an HMM (optimizing all its probabilities) takes a long time (it depends on the speed of the computer, but 10 minutes or so to adjust probabilities for a fixed structure, and often hours if one additionally attempts structure learning).
1. cd edu/stanford/nlp/ie/training/
2. java -server edu.stanford.nlp.ie.hmm.Trainer companydata.txt company mycompany.hmm
3. java edu.stanford.nlp.ie.hmm.HMMSingleFieldExtractor Company mycompany.hmm mycompany.obj
4. java edu.stanford.nlp.ie.hmm.Tester testdoc.txt company mycompany.hmm
The third step converts a serialized HMM into the serialized objects used in ExtractDemo. Note that company in the second line must match the element name in the marked-up data that you will train on, while Company in the third line must match the relation name in the ontology over which you will extract with mycompany.obj. These two names need not be the same. The last step then runs the trained HMM on a file.

4. Extraction of descriptions (such as biographical information about a person or a description of an animal). This does extraction of such descriptions from a web page. This component uses a POS tagger, and looks for where to find a path to it in the file descextractor.init in the current directory. So, you should be in the root directory of the current archive, which has such a file. Double click on the included MergeExtractDemo.bat in that directory, or by hand one can equivalently do: java edu.stanford.nlp.ie.ExtractDemo -m

Select as Extractor Directory the folder: serialized-extractors/description
Select as an Ontology the one in serialized-extractors/description/Entity-NameDescription.kaon
Click on "Select Merger" and then navigate to serialized-extractors/mergers and Select the file unscoredmerger.obj.
Enter Entity as the Concept to extract.
You can then do various searches for people or animals by entering words in the "Search for words" box and pressing Extract:
- Gareth Evans
- Tawny Frogmouth
- Christopher Manning
- Joshua Nkomo
The first search will be slower than subsequent searches, as it takes a while to load the part of speech tagger.