public class TokensRegexNERAnnotator extends java.lang.Object implements Annotator
regex1 TYPE overwritableType1,Type2... priority regex2 TYPE overwritableType1,Type2... priority ...where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking (higher priority is favored) is used to choose between the possible types. When the priority is the same, then longer matches are favored.
This annotator is designed to be used as part of a full
NER system to label entities that don't fall into the usual NER categories. It only records the label
if the token has not already been NER-annotated, or it has been annotated but the NER-type has been
designated overwritable (the third argument).
It is also possible to use this annotator to annotate fields other than the
NamedEntityTagAnnotation
field by
and providing the header
The first column regex may follow one of two formats:
TokenSequencePattern
for TokensRegex syntax.
( /University/ /of/ [ {ner:LOCATION} ] ) SCHOOL
Stanford SCHOOL
This annotator is similar to RegexNERAnnotator
but uses TokensRegex as the underlying library for matching
regular expressions. This allows for more flexibility in the types of expressions matched as well as utilizing
any optimization that is included in the TokensRegex library.
Main differences from RegexNERAnnotator
:
NamedEntityTagAnnotation
fieldThe ABC Company
, Found Phrase: ABC =>
Old NER labels are not replaced.
The/O ABC/MISC Company/ORG => The/ORG ABC/ORG Company/ORG
validpospattern
is handled for POS tags is specified by PosMatchType
validPosPattern
Configuration:
Field | Description | Default |
---|---|---|
mapping | Comma separated list of mapping files to use | edu/stanford/nlp/models/kbp/regexner_caseless.tab |
mapping.header |
Comma separated list of header fields (or true if header is specified in the file) |
pattern,ner,overwrite,priority,group |
mapping.field.<fieldname> |
Class mapping for annotation fields other than ner | |
commonWords |
Comma separated list of files for common words to not annotate (in case your mapping isn't very clean) | |
backgroundSymbol | Comma separated list of NER labels to always replace | O,MISC |
posmatchtype |
How should validpospattern be used to match the POS of the tokens.
MATCH_ALL_TOKENS - All tokens has to match.MATCH_AT_LEAST_ONE_TOKEN - At least one token has to match.MATCH_ONE_TOKEN_PHRASE_ONLY - Only has to match for one token phrases. |
MATCH_AT_LEAST_ONE_TOKEN |
validpospattern | Regular expression pattern for matching POS tags
(with .matches() ). |
|
noDefaultOverwriteLabels |
Comma separated list of output types for which default NER labels are not overwritten. For these types, only if the matched expression has NER type matching the specified overwriteableType for the regex will the NER type be overwritten. | |
ignoreCase | If true, case is ignored | false |
verbose | If true, turns on extra debugging messages. | false |
You can specify a different header for each mapping file. Here is an example of mapping files with header declaration:
properties.setProperty("ner.fine.regexner.mapping", "ignorecase=true, header=pattern overwrite priority, file1.tab;" + "ignorecase=true, file2.tab");The header items are whitespace separated and can also be specified for one of the files. Files MUST be separated with a semi-colon. In the same way, it is possible to fetch the header from the first line of a specific file.
properties.setProperty("ner.fine.regexner.mapping", "ignorecase=true, header=true, file1.tab;" + "ignorecase=true, header=pattern overwrite priority group, file2.tab");
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DEFAULT_BACKGROUND_SYMBOL |
static edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType |
DEFAULT_POS_MATCH_TYPE |
protected static java.lang.String |
defaultHeader |
protected static java.lang.String |
GROUP_FIELD |
protected static Redwood.RedwoodChannels |
logger
A logger for this class
|
protected static java.lang.String |
OVERWRITE_FIELD |
protected static java.lang.String |
PATTERN_FIELD |
protected static java.util.Set<java.lang.String> |
predefinedHeaderFields |
protected static java.lang.String |
PRIORITY_FIELD |
static PropertiesUtils.Property[] |
SUPPORTED_PROPERTIES |
protected static java.lang.String |
WEIGHT_FIELD |
DEFAULT_REQUIREMENTS, STANFORD_CDC_TOKENIZE, STANFORD_CLEAN_XML, STANFORD_COLUMN_DATA_CLASSIFIER, STANFORD_COREF, STANFORD_COREF_MENTION, STANFORD_DEPENDENCIES, STANFORD_DETERMINISTIC_COREF, STANFORD_DOCDATE, STANFORD_ENTITY_MENTIONS, STANFORD_GENDER, STANFORD_KBP, STANFORD_LEMMA, STANFORD_LINK, STANFORD_MWT, STANFORD_NATLOG, STANFORD_NER, STANFORD_OPENIE, STANFORD_PARSE, STANFORD_POS, STANFORD_QUOTE, STANFORD_QUOTE_ATTRIBUTION, STANFORD_REGEXNER, STANFORD_RELATION, STANFORD_SENTIMENT, STANFORD_SSPLIT, STANFORD_TOKENIZE, STANFORD_TOKENSREGEX, STANFORD_TRUECASE, STANFORD_UD_FEATURES
Constructor and Description |
---|
TokensRegexNERAnnotator(java.lang.String mapping)
Construct a new TokensRegexAnnotator.
|
TokensRegexNERAnnotator(java.lang.String mapping,
boolean ignoreCase) |
TokensRegexNERAnnotator(java.lang.String mapping,
boolean ignoreCase,
java.lang.String validPosRegex) |
TokensRegexNERAnnotator(java.lang.String name,
java.util.Properties properties) |
Modifier and Type | Method and Description |
---|---|
void |
annotate(Annotation annotation)
Given an Annotation, perform a task on this Annotation.
|
java.util.Set<java.lang.Class<? extends CoreAnnotation>> |
requirementsSatisfied()
Returns a set of requirements for which tasks this annotator can
provide.
|
java.util.Set<java.lang.Class<? extends CoreAnnotation>> |
requires()
Returns the set of tasks which this annotator requires in order
to perform.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
exactRequirements, unmount
protected static final Redwood.RedwoodChannels logger
protected static final java.lang.String PATTERN_FIELD
protected static final java.lang.String OVERWRITE_FIELD
protected static final java.lang.String PRIORITY_FIELD
protected static final java.lang.String WEIGHT_FIELD
protected static final java.lang.String GROUP_FIELD
protected static final java.util.Set<java.lang.String> predefinedHeaderFields
protected static final java.lang.String defaultHeader
public static final edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.PosMatchType DEFAULT_POS_MATCH_TYPE
public static final java.lang.String DEFAULT_BACKGROUND_SYMBOL
public static PropertiesUtils.Property[] SUPPORTED_PROPERTIES
public TokensRegexNERAnnotator(java.lang.String mapping)
mapping
- A comma-separated list of files, URLs, or classpath resources to load mappings frompublic TokensRegexNERAnnotator(java.lang.String mapping, boolean ignoreCase)
public TokensRegexNERAnnotator(java.lang.String mapping, boolean ignoreCase, java.lang.String validPosRegex)
public TokensRegexNERAnnotator(java.lang.String name, java.util.Properties properties)
public void annotate(Annotation annotation)
Annotator
public java.util.Set<java.lang.Class<? extends CoreAnnotation>> requires()
Annotator
public java.util.Set<java.lang.Class<? extends CoreAnnotation>> requirementsSatisfied()
Annotator
requirementsSatisfied
in interface Annotator