Package edu.stanford.nlp.ie

This package implements various subpackages for information extraction.

See:
          Description

Class Summary
AbstractSequenceClassifier<IN extends CoreMap> This class provides common functionality for (probabilistic) sequence models.
AcquisitionsPrior<IN extends CoreMap>  
ClassifierCombiner<IN extends CoreMap & HasWord> Merges the outputs of two or more AbstractSequenceClassifiers according to a simple precedence scheme: any given base classifier contributes only classifications of labels that do not exist in the base classifiers specified before, and that do not have any token overlap with labels assigned by higher priority classifiers.
EmpiricalNERPrior<IN extends CoreMap>  
EntityCachingAbstractSequencePrior<IN extends CoreMap> This class keeps track of all labeled entities and updates the its list whenever the label at a point gets changed.
NERFeatureFactory<IN extends CoreLabel> Features for Named Entity Recognition.
NERServer A named-entity recognizer server for Stanford's NER.
NERServer.NERClient This example sends material to the NER server one line at a time.
SeminarsPrior<IN extends CoreMap>  
UniformPrior<IN extends CoreMap> Uniform prior to be used for generic Gibbs inference in the ie.crf.CRFClassifier
 

Package edu.stanford.nlp.ie Description

This package implements various subpackages for information extraction. Some examples of use appear later in this description. At the moment, three types of information extraction are supported (where some of these have internal variants):

  1. Regular expression based matching: These extractors are hand-written and match whatever the regular expression matches.
  2. Conditional Random Fields classifier: A sequence tagger based on CRF model that can be used for NER tagging and other sequence labeling tasks.
  3. Conditional Markov Model classifier: A classifier based on CMM model that can be used for NER tagging and other labeling tasks.
  4. Hidden Markov model based extractors: These can be either single field extractors or two level HMMs where the individual component models and how they are glued together is trained separately. These models are trained automatically, but require tagged training data.
  5. Description extractor: This does higher level NLP analysis of sentences (using a POS tagger and chunker) to find sentences that describe an object. This might be a biography of a person, or a description of an animal. This module is fixed: there is nothing to write or train (unless one wants to start to change its internal behavior).

There are some demonstrations of the stuff here which you can run (and several other classes have main() methods which exhibit their functionality):

  1. NERGUI is a simple GUI front-end to the NER tagging components.
  2. crf/NERGUI is a simple GUI front-end to the CRF-based NER tagging components. This version only supports the CRF-based NER tagger.
  3. demo/NERDemo is a simple class examplifying the programmatical use of the CRF-based NER tagger.

Usage examples

0. Setup: For all of these examples except 3., you need to be connected to the Internet, and for the application's web search module to be able to connect to search engines. The web search functionality is provided by the supplied edu.stanford.nlp.web package. How web search works is controlled by a websearch.init file in your current directory (or if none is present, you will get search results from AltaVista). If you are registered to use the GoogleAPI, you should probably edit this file so web queries can be done to Google using their SOAP interface. Even if not, you can specify additional or different search engines to access in websearch.init. A copy of this file is supplied in the distribution. The DescExtractor in 4. also requires another init file so that it can use the include part-of-speech tagger.

1. Corporate Contact Information. This illustrates simple information extraction from a web page. Using the included ExtractDemo.bat or by hand run: java edu.stanford.nlp.ie.ExtractDemo

2. Corporate Contact Information merged. This illustrates the addition of information merger across web pages. Using the included MergeExtractDemo.bat or similarly do:

java edu.stanford.nlp.ie.ExtractDemo -m

The ExtractDemo screen is similar, but adds a button to Select a Merger.

3. Company names via direct use of an HMM information extractor. One can also train, load, and use HMM information extractors directly, without using any of the RDF-based KAON framework (http://kaon.semanticweb.org/) used by ExtractDemo.

4. Extraction of descriptions (such as biographical information about a person or a description of an animal). This does extraction of such descriptions from a web page. This component uses a POS tagger, and looks for where to find a path to it in the file descextractor.init in the current directory. So, you should be in the root directory of the current archive, which has such a file. Double click on the included MergeExtractDemo.bat in that directory, or by hand one can equivalently do: java edu.stanford.nlp.ie.ExtractDemo -m



Stanford NLP Group