TokensRegexNERAnnotator (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.pipeline.TokensRegexNERAnnotator

All Implemented Interfaces:: Annotator

public class TokensRegexNERAnnotator
extends java.lang.Object
implements Annotator

TokensRegexNERAnnotator labels tokens with types based on a simple manual mapping from regular expressions to the types of the entities they are meant to describe. The user provides a file formatted as follows:

    regex1    TYPE    overwritableType1,Type2...    priority
    regex2    TYPE    overwritableType1,Type2...    priority
    ...

where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking (higher priority is favored) is used to choose between the possible types. When the priority is the same, then longer matches are favored.

This annotator is designed to be used as part of a full NER system to label entities that don't fall into the usual NER categories. It only records the label if the token has not already been NER-annotated, or it has been annotated but the NER-type has been designated overwritable (the third argument). It is also possible to use this annotator to annotate fields other than the NamedEntityTagAnnotation field by and providing the header

The first column regex may follow one of two formats:

A TokensRegex expression (marked by starting with "( " and ending with " )". See TokenSequencePattern for TokensRegex syntax.
Example: ( /University/ /of/ [ {ner:LOCATION} ] ) SCHOOL
a sequence of regex, each separated by whitespace (matching "\s+").
Example: Stanford SCHOOL
The regex will match if the successive regex match a sequence of tokens in the input. Spaces can only be used to separate regular expression tokens; within tokens \s or similar non-space representations need to be used instead.
Notes: Following Java regex conventions, some characters in the file need to be escaped. Only a single backslash should be used though, as these are not String literals. The input to RegexNER will have already been tokenized. So, for example, with our usual English tokenization, things like genitives and commas at the end of words will be separated in the input and matched as a separate token.

This annotator is similar to RegexNERAnnotator but uses TokensRegex as the underlying library for matching regular expressions. This allows for more flexibility in the types of expressions matched as well as utilizing any optimization that is included in the TokensRegex library.

Main differences from RegexNERAnnotator:

Supports annotation of fields other than the NamedEntityTagAnnotation field
Supports both TokensRegex patterns and patterns over the text of the tokens
When NER annotation can be overwritten based on the original NER labels. The rules for when the new NER labels are used are given below:
If the found expression overlaps with a previous NER phrase, then the NER labels are not replaced.
Example: Old NER phrase: The ABC Company, Found Phrase: ABC => Old NER labels are not replaced.
If the found expression has inconsistent NER tags among the tokens, then the NER labels are replaced.
Example: Old NER phrase: The/O ABC/MISC Company/ORG => The/ORG ABC/ORG Company/ORG
How validpospattern is handled for POS tags is specified by PosMatchType
By default, there is no validPosPattern
By default, both O and MISC is always replaced

Configuration:

Annotator configuration
Field	Description	Default
`mapping`	Comma separated list of mapping files to use	`edu/stanford/nlp/models/kbp/regexner_caseless.tab`
`mapping.header`	Comma separated list of header fields (or `true` if header is specified in the file)	pattern,ner,overwrite,priority,group
`mapping.field.<fieldname>`	Class mapping for annotation fields other than ner
`commonWords`	Comma separated list of files for common words to not annotate (in case your mapping isn't very clean)
`backgroundSymbol`	Comma separated list of NER labels to always replace	`O,MISC`
`posmatchtype`	How should `validpospattern` be used to match the POS of the tokens. `MATCH_ALL_TOKENS` - All tokens has to match. `MATCH_AT_LEAST_ONE_TOKEN` - At least one token has to match. `MATCH_ONE_TOKEN_PHRASE_ONLY` - Only has to match for one token phrases.	`MATCH_AT_LEAST_ONE_TOKEN`
`validpospattern`	Regular expression pattern for matching POS tags (with `.matches()`).
`noDefaultOverwriteLabels`	Comma separated list of output types for which default NER labels are not overwritten. For these types, only if the matched expression has NER type matching the specified overwriteableType for the regex will the NER type be overwritten.
`ignoreCase`	If true, case is ignored	`false`
`verbose`	If true, turns on extra debugging messages.	`false`

You can specify a different header for each mapping file. Here is an example of mapping files with header declaration:

 properties.setProperty("ner.fine.regexner.mapping", "ignorecase=true, header=pattern overwrite priority, file1.tab;" + "ignorecase=true, file2.tab");

The header items are whitespace separated and can also be specified for one of the files. Files MUST be separated with a semi-colon. In the same way, it is possible to fetch the header from the first line of a specific file.

 properties.setProperty("ner.fine.regexner.mapping", "ignorecase=true, header=true, file1.tab;" + "ignorecase=true, header=pattern overwrite priority group, file2.tab");

Author:: Angel Chang, Alberto Soragna (@alsora) (added per-file headers

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`DEFAULT_BACKGROUND_SYMBOL`
`static edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType`	`DEFAULT_POS_MATCH_TYPE`
`protected static java.lang.String`	`defaultHeader`
`protected static java.lang.String`	`GROUP_FIELD`
`protected static Redwood.RedwoodChannels`	`logger` A logger for this class
`protected static java.lang.String`	`OVERWRITE_FIELD`
`protected static java.lang.String`	`PATTERN_FIELD`
`protected static java.util.Set<java.lang.String>`	`predefinedHeaderFields`
`protected static java.lang.String`	`PRIORITY_FIELD`
`static PropertiesUtils.Property[]`	`SUPPORTED_PROPERTIES`
`protected static java.lang.String`	`WEIGHT_FIELD`

Constructor Summary

Constructors
Constructor and Description
`TokensRegexNERAnnotator(java.lang.String mapping)` Construct a new TokensRegexAnnotator.
`TokensRegexNERAnnotator(java.lang.String mapping, boolean ignoreCase)`
`TokensRegexNERAnnotator(java.lang.String mapping, boolean ignoreCase, java.lang.String validPosRegex)`
`TokensRegexNERAnnotator(java.lang.String name, java.util.Properties properties)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`annotate(Annotation annotation)` Given an Annotation, perform a task on this Annotation.
`java.util.Set<java.lang.Class<? extends CoreAnnotation>>`	`requirementsSatisfied()` Returns a set of requirements for which tasks this annotator can provide.
`java.util.Set<java.lang.Class<? extends CoreAnnotation>>`	`requires()` Returns the set of tasks which this annotator requires in order to perform.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface edu.stanford.nlp.pipeline.Annotator
exactRequirements, unmount

Field Detail

logger

protected static final Redwood.RedwoodChannels logger

A logger for this class

PATTERN_FIELD

protected static final java.lang.String PATTERN_FIELD

See Also:: Constant Field Values

OVERWRITE_FIELD

protected static final java.lang.String OVERWRITE_FIELD

See Also:: Constant Field Values

PRIORITY_FIELD

protected static final java.lang.String PRIORITY_FIELD

See Also:: Constant Field Values

WEIGHT_FIELD

protected static final java.lang.String WEIGHT_FIELD

See Also:: Constant Field Values

GROUP_FIELD

protected static final java.lang.String GROUP_FIELD

See Also:: Constant Field Values

predefinedHeaderFields

protected static final java.util.Set<java.lang.String> predefinedHeaderFields

defaultHeader

protected static final java.lang.String defaultHeader

See Also:: Constant Field Values

DEFAULT_POS_MATCH_TYPE

public static final edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType DEFAULT_POS_MATCH_TYPE

DEFAULT_BACKGROUND_SYMBOL

public static final java.lang.String DEFAULT_BACKGROUND_SYMBOL

See Also:: Constant Field Values

SUPPORTED_PROPERTIES

public static PropertiesUtils.Property[] SUPPORTED_PROPERTIES

Constructor Detail

TokensRegexNERAnnotator
```
public TokensRegexNERAnnotator(java.lang.String mapping)
```
Construct a new TokensRegexAnnotator.

Parameters:

mapping - A comma-separated list of files, URLs, or classpath resources to load mappings from

TokensRegexNERAnnotator

public TokensRegexNERAnnotator(java.lang.String mapping,
                               boolean ignoreCase)

TokensRegexNERAnnotator

public TokensRegexNERAnnotator(java.lang.String mapping,
                               boolean ignoreCase,
                               java.lang.String validPosRegex)

TokensRegexNERAnnotator

public TokensRegexNERAnnotator(java.lang.String name,
                               java.util.Properties properties)

Method Detail
- annotate
```
public void annotate(Annotation annotation)
```
  Description copied from interface: Annotator
  
  Given an Annotation, perform a task on this Annotation.
  
  Specified by:
  
  annotate in interface Annotator
- requires
```
public java.util.Set<java.lang.Class<? extends CoreAnnotation>> requires()
```
  Description copied from interface: Annotator
  
  Returns the set of tasks which this annotator requires in order to perform. For example, the POS annotator will return "tokenize", "ssplit".
  
  Specified by:
  
  requires in interface Annotator
- requirementsSatisfied
```
public java.util.Set<java.lang.Class<? extends CoreAnnotation>> requirementsSatisfied()
```
  Description copied from interface: Annotator
  
  Returns a set of requirements for which tasks this annotator can provide. For example, the POS annotator will return "pos".
  
  Specified by:
  
  requirementsSatisfied in interface Annotator

Class TokensRegexNERAnnotator

Field Summary

Fields inherited from interface edu.stanford.nlp.pipeline.Annotator

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface edu.stanford.nlp.pipeline.Annotator

Field Detail

logger

PATTERN_FIELD

OVERWRITE_FIELD

PRIORITY_FIELD

WEIGHT_FIELD

GROUP_FIELD

predefinedHeaderFields

defaultHeader

DEFAULT_POS_MATCH_TYPE

DEFAULT_BACKGROUND_SYMBOL

SUPPORTED_PROPERTIES

Constructor Detail

TokensRegexNERAnnotator

TokensRegexNERAnnotator

TokensRegexNERAnnotator

TokensRegexNERAnnotator

Method Detail

annotate

requires

requirementsSatisfied