Stanford TokensRegex

About | Download | Usage | Questions | Mailing lists | Release history

About

TokensRegex is a generic framework included in Stanford CoreNLP for defining patterns over text and mapping to semantic objects represented as Java objects. As part of the framework, TokensRegex provides the following:

TokensRegex was used to develop SUTime, a rule-based temporal tagger for recognizing and normalizing temporal expressions. An included set of slides and the javadoc for TokenSequencePattern provide an overview of this package.

TokensRegex was written by Angel Chang. These programs also rely on classes developed by others as part of the Stanford JavaNLP project.

Usage

TokensRegex Annotator

TokensRegex can be used as an annotator in the StanfordCoreNLP pipeline.The TokensRegexAnnotator is provided so you can easily customize annotations based on regular expressions over sequences of tokens. To add a TokensRegexAnnotator to the pipeline:
  1. Create rules file (see SequenceMatchRules for format of the rule file).
    Example: color.rules.txt
    Rules to tag color words with ner="COLOR", and normalized=hex rgb string
  2. Configure the TokensRegexAnnotator
    customAnnotatorClass.[name]=edu.stanford.nlp.pipeline.TokensRegexAnnotator
    [name].rules = [path to rules file] 
    Example:
    Configuration: color.properties
    customAnnotatorClass.color=edu.stanford.nlp.pipeline.TokensRegexAnnotator
    color.rules = color.rules.txt 
  3. Add the annotator to the pipeline
    Example:
    Input: color.input.txt
    Command:
    java -cp stanford-corenlp-2012-05-22.jar:stanford-corenlp-2012-05-22-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,color -properties color.properties -file color.input.txt

Download

TokensRegex is integrated in the Stanford suite of NLP tools, StanfordCoreNLP. Please download the entire suite from this page.

Questions

Questions, feedback, and bug reports/fixes can be sent to our mailing lists.

Mailing Lists

We have 3 mailing lists for TokensRegex, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu:

  1. java-nlp-user This is the best list to post to in order to ask questions, make announcements, or for discussion among JavaNLP users. You have to subscribe to be able to use it. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.
  2. java-nlp-announce This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 messages a year). Join the list via this webpage or by emailing java-nlp-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
  3. java-nlp-support This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, you're better off joining and using java-nlp-user. You cannot join java-nlp-support, but you can mail questions to java-nlp-support@lists.stanford.edu.

Release History

Version 1.3.2 2012-05-22 Added TokensRegexAnnotator that can be configured using rules file
Version 1.2.0 2011-09-14 Initial version of TokensRegex library