Stanford TokensRegexAnnotator

About | Download | Usage | Questions | Mailing lists | Release history

About

TokensRegexAnnotator is an customizable annotator for the StanfordCoreNLP pipeline. It is part of the TokensRegex, a framework for defining patterns over text and mapping to semantic objects represented as Java objects.

By using the TokensRegexAnnotator, you can customize annotations based on regular expressions over sequences of tokens. It uses TokensRegex rules to define what patterns to match and what to annotate.

If you only want to use TokensRegex to recognize named entities using regular expression, then you should use the TokensRegexNERAnnotator instead.

Usage

To add a TokensRegexAnnotator to the pipeline:
  1. Create rules file (see SequenceMatchRules for format of the rule file).
    Example: color.rules.txt
    Rules to tag color words with ner="COLOR", and normalized=hex rgb string
  2. Configure the TokensRegexAnnotator
    customAnnotatorClass.[name]=edu.stanford.nlp.pipeline.TokensRegexAnnotator
    [name].rules = [path to rules file] 
    Example:
    Configuration: color.properties
    customAnnotatorClass.color=edu.stanford.nlp.pipeline.TokensRegexAnnotator
    color.rules = color.rules.txt 
  3. Add the annotator to the pipeline
    Example:
    Input: color.input.txt
    Command:
    java -cp stanford-corenlp-2012-05-22.jar:stanford-corenlp-2012-05-22-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,color -properties color.properties -file color.input.txt
See TokensRegex Extraction Pipeline and TokensRegex Rules for more information on how rules are specified and expressions are matched using TokensRegex.