public class TokensRegexNERAnnotator extends Object implements Annotator
TokensRegexNERAnnotator labels tokens with types based on a simple manual mapping from regular expressions to the types of the entities they are meant to describe. The user provides a file formatted as follows:
regex1 TYPE overwritableType1,Type2... priority regex2 TYPE overwritableType1,Type2... priority ...where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking is used to choose between the possible types. When the priority is the same, then longer matches are favored. This classifier is designed to be used as part of a full NER system to label entities that don't fall into the usual NER categories. It only records the label if the token has not already been NER-annotated, or it has been annotated but the NER-type has been designated overwritable (the third argument).
The first column regex may follow one of two formats:
TokenSequencePattern
for TokensRegex syntax.
( /University/ /of/ [ {ner:LOCATION} ] ) SCHOOL
Stanford SCHOOL
This annotator is similar to RegexNERAnnotator
but uses TokensRegex as the underlying library for matching
regular expressions. This allows for more flexibility in the types of expressions matched as well as utilizing
any optimization that is included in the TokensRegex library.
Main differences from RegexNERAnnotator
:
The ABC Company
, Found Phrase: ABC =>
Old NER labels are not replaced.
The/O ABC/MISC Company/ORG => The/ORG ABC/ORG Company/ORG
validpospattern
is handled for POS tags is specified by PosMatchType
validPosPattern
Configuration:
Field | Description | Default |
---|---|---|
mapping | Comma separated list of mapping files to use | edu/stanford/nlp/models/regexner/type_map_clean |
backgroundSymbol | Comma separated list of NER labels to always replace | O,MISC |
posmatchtype |
How should validpospattern be used to match the POS of the tokens.
MATCH_ALL_TOKENS - All tokens has to match.MATCH_AT_LEAST_ONE_TOKEN - At least one token has to match.MATCH_ONE_TOKEN_PHRASE_ONLY - Only has to match for one token phrases. |
MATCH_AT_LEAST_ONE_TOKEN |
validpospattern | Regular expression pattern for matching POS tags. |
|
noDefaultOverwriteLabels |
Comma separated list of output types for which default NER labels are not overwritten. For these types, only if the matched expression has NER type matching the specified overwriteableType for the regex will the NER type be overwritten. |
|
ignoreCase | If true, case is ignored | false |
verbose | If true, turns on extra debugging messages. | false |
Annotator.Requirement
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_BACKGROUND_SYMBOL |
static edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType |
DEFAULT_POS_MATCH_TYPE |
protected static Redwood.RedwoodChannels |
logger |
static PropertiesUtils.Property[] |
SUPPORTED_PROPERTIES |
BINARIZED_TREES_REQUIREMENT, CLEAN_XML_REQUIREMENT, COLUMN_DATA_CLASSIFIER, DETERMINISTIC_COREF_REQUIREMENT, GENDER_REQUIREMENT, GUTIME_REQUIREMENT, HEIDELTIME_REQUIREMENT, LEMMA_REQUIREMENT, NER_REQUIREMENT, NUMBER_REQUIREMENT, PARSE_AND_TAG, PARSE_REQUIREMENT, PARSE_TAG_BINARIZED_TREES, POS_REQUIREMENT, QUANTIFIABLE_ENTITY_NORMALIZATION_REQUIREMENT, RELATION_EXTRACTOR_REQUIREMENT, SSPLIT_REQUIREMENT, STANFORD_CLEAN_XML, STANFORD_COLUMN_DATA_CLASSIFIER, STANFORD_DEPENDENCIES, STANFORD_DETERMINISTIC_COREF, STANFORD_GENDER, STANFORD_LEMMA, STANFORD_NER, STANFORD_PARSE, STANFORD_POS, STANFORD_REGEXNER, STANFORD_RELATION, STANFORD_SENTIMENT, STANFORD_SSPLIT, STANFORD_TOKENIZE, STANFORD_TRUECASE, STEM_REQUIREMENT, SUTIME_REQUIREMENT, TIME_WORDS_REQUIREMENT, TOKENIZE_AND_SSPLIT, TOKENIZE_REQUIREMENT, TOKENIZE_SSPLIT_NER, TOKENIZE_SSPLIT_PARSE, TOKENIZE_SSPLIT_PARSE_NER, TOKENIZE_SSPLIT_POS, TOKENIZE_SSPLIT_POS_LEMMA, TRUECASE_REQUIREMENT
Constructor and Description |
---|
TokensRegexNERAnnotator(String mapping)
Construct a new TokensRegexAnnotator.
|
TokensRegexNERAnnotator(String mapping,
boolean ignoreCase) |
TokensRegexNERAnnotator(String mapping,
boolean ignoreCase,
String validPosRegex) |
TokensRegexNERAnnotator(String name,
Properties properties) |
Modifier and Type | Method and Description |
---|---|
void |
annotate(Annotation annotation)
Given an Annotation, perform a task on this Annotation.
|
Set<Annotator.Requirement> |
requirementsSatisfied()
Returns a set of requirements for which tasks this annotator can
provide.
|
Set<Annotator.Requirement> |
requires()
Returns the set of tasks which this annotator requires in order
to perform.
|
protected static final Redwood.RedwoodChannels logger
public static final edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType DEFAULT_POS_MATCH_TYPE
public static final String DEFAULT_BACKGROUND_SYMBOL
public static PropertiesUtils.Property[] SUPPORTED_PROPERTIES
public TokensRegexNERAnnotator(String mapping)
mapping
- A comma-separated list of files, URLs, or classpath resources to load mappings frompublic TokensRegexNERAnnotator(String mapping, boolean ignoreCase)
public TokensRegexNERAnnotator(String mapping, boolean ignoreCase, String validPosRegex)
public TokensRegexNERAnnotator(String name, Properties properties)
public void annotate(Annotation annotation)
Annotator
public Set<Annotator.Requirement> requires()
Annotator
public Set<Annotator.Requirement> requirementsSatisfied()
Annotator
requirementsSatisfied
in interface Annotator