RegexNERSequenceClassifier (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.ie.AbstractSequenceClassifier<CoreLabel>
- - edu.stanford.nlp.ie.regexp.RegexNERSequenceClassifier

All Implemented Interfaces:

java.util.function.Function<java.lang.String,java.lang.String>
```
public class RegexNERSequenceClassifier
extends AbstractSequenceClassifier<CoreLabel>
```
A sequence classifier that labels tokens with types based on a simple manual mapping from regular expressions to the types of the entities they are meant to describe. The user provides a file formatted as follows:
```
    regex1    TYPE    overwritableType1,Type2...    priority
    regex2    TYPE    overwritableType1,Type2...    priority
    ...
 
```
where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking is used to choose between the possible types. This classifier is designed to be used as part of a full NER system to label entities that don't fall into the usual NER categories. It only records the label if the token has not already been NER-annotated, or it has been annotated but the NER-type has been designated overwritable (the third argument). Note that this is evaluated token-wise in this classifier, and so it may assign a label against a token sequence that is partly background and partly overwritable. (In contrast, RegexNERAnnotator doesn't allow this.) It assigns labels to AnswerAnnotation, while checking for existing labels in NamedEntityTagAnnotation. The first column regex may be a sequence of regex, each separated by whitespace (matching "\\s+"). The regex will match if the successive regex match a sequence of tokens in the input. Spaces can only be used to separate regular expression tokens; within tokens \\s or similar non-space representations need to be used instead. Notes: Following Java regex conventions, some characters in the file need to be escaped. Only a single backslash should be used though, as these are not String literals. The input to RegexNER will have already been tokenized. So, for example, with our usual English tokenization, things like genitives and commas at the end of words will be separated in the input and matched as a separate token. This class isn't implemented very efficiently, since every regex is evaluated at every token position. So it can and does get quite slow if you have a lot of patterns in your NER rules. TokensRegex is a more general framework to provide the functionality of this class. But at present we still use this class.
Author:

jtibs, Mihai

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String DEFAULT_VALID_POS
- Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
  classIndex, featureFactories, flags, knownLCWords, pad, windowSize

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`DEFAULT_VALID_POS`

Constructor Summary

Constructors
Constructor and Description
`RegexNERSequenceClassifier(java.io.BufferedReader reader, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)` Make a new instance of this classifier.
`RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels)`
`RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)` Make a new instance of this classifier.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.util.List<CoreLabel>`	`classify(java.util.List<CoreLabel> document)` Classify a `List` of something that extends`CoreMap`.
`java.util.List<CoreLabel>`	`classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq, CoreMap doc, CoreMap sent)` Classify a `List` of something that extends `CoreMap` using as additional information whatever is stored in the document and sentence.
`java.util.Set<java.lang.String>`	`labels()` Most AbstractSequenceClassifiers have classIndex set.
`void`	`loadClassifier(java.io.ObjectInputStream in, java.util.Properties props)` Load a classifier from the specified input stream.
`void`	`serializeClassifier(java.io.ObjectOutputStream oos)` Serialize a sequence classifier to an object output stream
`void`	`serializeClassifier(java.lang.String serializePath)` Serialize a sequence classifier to a file on the given path.
`void`	`train(java.util.Collection<java.util.List<CoreLabel>> docs, DocumentReaderAndWriter<CoreLabel> readerAndWriter)` Trains a classifier from a Collection of sequences.

Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyFilesAndWriteAnswers, classifyFilesAndWriteAnswers, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResultsSegmenter, defaultReaderAndWriter, dumpFeatures, finalizeClassification, getKnownLCWords, getSampler, getSequenceModel, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, makeObjectBankFromFile, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbs, printProbsDocument, printProbsDocuments, printResults, reinit, segmentString, segmentString, train, train, train, train, train, train, windowSize, writeAnswers

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.function.Function
andThen, compose, identity

- Field Detail
  - DEFAULT_VALID_POS
```
public static final java.lang.String DEFAULT_VALID_POS
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - RegexNERSequenceClassifier
```
public RegexNERSequenceClassifier(java.lang.String mapping,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels)
```
  - RegexNERSequenceClassifier
```
public RegexNERSequenceClassifier(java.lang.String mapping,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels,
                                  java.lang.String validPosRegex)
```
    Make a new instance of this classifier. The ignoreCase option allows case-insensitive regular expression matching, allowing the idea that the provided file might just be a manual list of the possible entities for each type.
    
    Parameters:
    
    mapping - A String describing a file/classpath/URI for the RegexNER patterns
    
    ignoreCase - The regex in the mapping file should be compiled ignoring case
    
    overwriteMyLabels - If true, this classifier overwrites NE labels generated through this regex NER. This is necessary because sometimes the RegexNERSequenceClassifier is run successively over the same text (e.g., to overwrite some older annotations).
    
    validPosRegex - May be null or an empty String, in which case any (or no) POS is valid in matching. Otherwise, this is a regex which is matched with find() [not matches()] and which must be matched by the POS of at least one word in the sequence for it to be labeled via any matching rules. (Note that this is a postfilter; using this will not speed up matching.)
  - RegexNERSequenceClassifier
```
public RegexNERSequenceClassifier(java.io.BufferedReader reader,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels,
                                  java.lang.String validPosRegex)
```
    Make a new instance of this classifier. The ignoreCase option allows case-insensitive regular expression matching, allowing the idea that the provided file might just be a manual list of the possible entities for each type.
    
    Parameters:
    
    reader - A Reader for the RegexNER patterns
    
    ignoreCase - The regex in the mapping file should be compiled ignoring case
    
    overwriteMyLabels - If true, this classifier overwrites NE labels generated through this regex NER. This is necessary because sometimes the RegexNERSequenceClassifier is run successively over the same text (e.g., to overwrite some older annotations).
    
    validPosRegex - May be null or an empty String, in which case any (or no) POS is valid in matching. Otherwise, this is a regex, and only words with a POS that match the regex will be labeled via any matching rules.
- Method Detail
  - labels
```
public java.util.Set<java.lang.String> labels()
```
    Most AbstractSequenceClassifiers have classIndex set. ClassifierCombiner calls labels() to get the values from the index.
    TODO: chceck that classIndex isn't used anywhere other than the call to labels()
    
    Overrides:
    
    labels in class AbstractSequenceClassifier<CoreLabel>
  - classify
```
public java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
```
    Description copied from class: AbstractSequenceClassifier
    
    Classify a List of something that extendsCoreMap. The classifications are added in place to the items of the document, which is also returned by this method. Warning: In many circumstances, you should not call this method directly. In particular, if you call this method directly, your document will not be preprocessed to add things like word distributional similarity class or word shape features that your classifier may rely on to work correctly. In such cases, you should call classifySentence instead.
    
    Specified by:
    
    classify in class AbstractSequenceClassifier<CoreLabel>
    
    Parameters:
    
    document - A List of something that extends CoreMap.
    
    Returns:
    
    The same List, but with the elements annotated with their answers (stored under the CoreAnnotations.AnswerAnnotation key). The answers will be the class labels defined by the CRF Classifier. They might be things like entity labels (in BIO notation or not) or something like "1" vs. "0" on whether to begin a new token here or not (in word segmentation).
  - classifyWithGlobalInformation
```
public java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq,
                                                               CoreMap doc,
                                                               CoreMap sent)
```
    Description copied from class: AbstractSequenceClassifier
    
    Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence. This is needed for SUTime (NumberSequenceClassifier), which requires the document date to resolve relative dates.
    
    Specified by:
    
    classifyWithGlobalInformation in class AbstractSequenceClassifier<CoreLabel>
    
    Parameters:
    
    tokenSeq - A List of something that extends CoreMap
    
    Returns:
    
    Classified version of the input tokenSequence
  - train
```
public void train(java.util.Collection<java.util.List<CoreLabel>> docs,
                  DocumentReaderAndWriter<CoreLabel> readerAndWriter)
```
    Description copied from class: AbstractSequenceClassifier
    
    Trains a classifier from a Collection of sequences. Note that the Collection can be (and usually is) an ObjectBank.
    
    Specified by:
    
    train in class AbstractSequenceClassifier<CoreLabel>
    
    Parameters:
    
    docs - An ObjectBank or a collection of sequences of IN
    
    readerAndWriter - A DocumentReaderAndWriter to use when loading test files
  - serializeClassifier
```
public void serializeClassifier(java.lang.String serializePath)
```
    Description copied from class: AbstractSequenceClassifier
    
    Serialize a sequence classifier to a file on the given path.
    
    Specified by:
    
    serializeClassifier in class AbstractSequenceClassifier<CoreLabel>
    
    Parameters:
    
    serializePath - The path/filename to write the classifier to.
  - serializeClassifier
```
public void serializeClassifier(java.io.ObjectOutputStream oos)
```
    Description copied from class: AbstractSequenceClassifier
    
    Serialize a sequence classifier to an object output stream
    
    Specified by:
    
    serializeClassifier in class AbstractSequenceClassifier<CoreLabel>
  - loadClassifier
```
public void loadClassifier(java.io.ObjectInputStream in,
                           java.util.Properties props)
                    throws java.io.IOException,
                           java.lang.ClassCastException,
                           java.lang.ClassNotFoundException
```
    Description copied from class: AbstractSequenceClassifier
    
    Load a classifier from the specified input stream. The classifier is reinitialized from the flags serialized in the classifier.
    
    Specified by:
    
    loadClassifier in class AbstractSequenceClassifier<CoreLabel>
    
    Parameters:
    
    in - The InputStream to load the serialized classifier from
    
    props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
    
    Throws:
    
    java.io.IOException - If there are problems accessing the input stream
    
    java.lang.ClassCastException - If there are problems interpreting the serialized data
    
    java.lang.ClassNotFoundException - If there are problems interpreting the serialized data

Class RegexNERSequenceClassifier

Field Summary

Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.function.Function

Field Detail

DEFAULT_VALID_POS

Constructor Detail

RegexNERSequenceClassifier

RegexNERSequenceClassifier

RegexNERSequenceClassifier

Method Detail

labels

classify

classifyWithGlobalInformation

train

serializeClassifier

serializeClassifier

loadClassifier