Software > Stanford TokensRegex > TokensRegexAnnotator

About | Download | Usage | Questions | Mailing lists | Release history


TokensRegexAnnotator is an customizable annotator for the StanfordCoreNLP pipeline. It is part of the TokensRegex, a framework for defining patterns over text and mapping to semantic objects represented as Java objects.

By using the TokensRegexAnnotator, you can customize annotations based on regular expressions over sequences of tokens. It uses TokensRegex rules to define what patterns to match and what to annotate.

If you only want to use TokensRegex to recognize named entities using regular expression, then you should use the TokensRegexNERAnnotator instead.


To add a TokensRegexAnnotator to the pipeline:
  1. Create rules file (see SequenceMatchRules for format of the rule file).
    Example: color.rules.txt
    Rules to tag color words with ner="COLOR", and normalized=hex rgb string
  2. Configure the TokensRegexAnnotator
    [name].rules = [path to rules file] 
    color.rules = color.rules.txt 
  3. Add the annotator to the pipeline
    Input: color.input.txt
    java -cp stanford-corenlp-2012-05-22.jar:stanford-corenlp-2012-05-22-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,color -properties -file color.input.txt
See TokensRegex Extraction Pipeline and TokensRegex Rules for more information on how rules are specified and expressions are matched using TokensRegex.