Stanford TokensRegex

About | Download | Usage | Questions | Mailing lists | Release history

About

TokensRegex is a generic framework included in Stanford CoreNLP for defining patterns over text (sequences of tokens) and mapping it to semantic objects represented as Java objects. TokensRegex emphasizes describing text as a sequence of tokens (words, punctuation marks, etc.), which may have additional attributes, and writing patterns over those tokens, rather than working at the character level, as with standard regular expression packages. For example, you might match names of people who are painters with a TokensRegex pattern like this:

 ([ner: PERSON]+) /was|is/ /an?/ []{0,3} /painter|artist/ 

As part of the framework, TokensRegex provides the following:

TokensRegex was used to develop SUTime, a rule-based temporal tagger for recognizing and normalizing temporal expressions. An included set of slides and the javadoc for TokenSequencePattern provide an overview of this package. Some additional information is available in some older slides.

If you use TokensRegex, please cite:

Angel X. Chang and Christopher D. Manning. 2014. TokensRegex: Defining cascaded regular expressions over tokens. Stanford University Technical Report, 2014. [bib]

TokensRegex was written by Angel Chang. These programs also rely on classes developed by others as part of the Stanford JavaNLP project.

Usage

Annotators

The Stanford CoreNLP pipeline provides two annotators that use TokensRegex to provide annotations based on regular expressions over tokens. They can be configured and added to the CoreNLP pipeline as custom annotators.
ClassGenerated AnnotationsDescription
TokensRegexNERAnnotator NamedEntityTagAnnotation Implements a simple, rule-based NER over token sequences using Java regular expressions. The goal of this Annotator is to provide a simple framework to incorporate NE labels that are not annotated in traditional NL corpora, but are easily recognized by rule-based techniques. For example, the default list of regular expressions that we distribute in the models file recognizes ideologies (IDEOLOGY), nationalities (NATIONALITY), religions (RELIGION), and titles (TITLE). This annotator is similar to RegexNERAnnotator but supports TokensRegex expressions as well as NER annotations using regular expressions specified in a file.
TokensRegexAnnotator Custom A more generic annotator that uses TokensRegex rules to define what patterns to match and what to annotate. This annotator is much more flexible than the TokensRegexNERAnnotator, but is also more complicated to use. It takes as input files of TokensRegex rules and extracts matched expression using the Extraction Pipeline.

Usage in Java code

TokensRegex can also be used programmatically with Java code.

TokensRegex Extraction Pipeline

The TokensRegex pipeline reads extraction rules from a file and applies the rules in stages. Each rule consists of a pattern to match, a possible action and result (see TokensRegex Rules for details on the format of the rules).

There are four types of extraction rules: text, tokens, composite and filter. In each stage, the extraction rules are applied as follows:

Within each sub-stage, overlapping matches are resolved based on the priority of the rules, length of the match, and finally the order in which the rules are specified.

TokensRegex Rules

The TokensRegex rules format is described in SequenceMatchRules. There are two types of rules: assignment rules which can be used to define variables for later use, and extraction rules used in the pipeline for matching expressions.

Extraction rules are specified in a JSON-like language.

Example:

{
  // ruleType is "text", "tokens", "composite", or "filter"
  ruleType: "tokens",
  
  // pattern to be matched  
  pattern: ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) ),

  // value associated with the expression for which the pattern was matched
  // matched expressions are returned with "DATE_OF_BIRTH" as the value
  // (as part of the MatchedExpression class)
  result: "DATE_OF_BIRTH"
}

See SequenceMatchRules for more details on the fields that can go in an extraction rule definition.

As a short hand, extraction rules only has the pattern and result fields can be expressed as:

{ pattern => result }

For instance, the above rule can be rewritten as:
{  ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) ) => "DATE_OF_BIRTH" }

There are four types of extraction rules: text, tokens, composite and filter.

Assignment rules are used to define variables for later use.

For example, to bind annotation keys:

tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

These variables can be used for matching against those annotation keys, or generating new annotations using those keys.

Assignment rules can also be used to bind TokensRegex patterns.

  $DAYOFWEEK = "/monday|tuesday|wednesday|thursday|friday|saturday|sunday/"
  $TIMEOFDAY = "/morning|afternoon|evening|night|noon|midnight/"

  // Match expressions like "monday afternoon"
  {
    ruleType: "tokens",
    pattern: ( $DAYOFWEEK $TIMEOFDAY ), 
    result: "TIME"
  }

See SequenceMatchRules for other possible uses of assignment rules.

Pattern Language

The TokensRegex pattern language is designed to be similar to the standard Java regular expressions. Many of the concepts from standard regular expressions for strings, such as wildcards and capturing groups, are supported by TokensRegex and use a similar syntax. The specifics of the language are described below. The main difference is in the syntax for matching individual tokens.

Defining regular expression for matching a single token

In Stanford CoreNLP, tokens are represented as a CoreMap (essentially a mapping from an attribute key (Class) to an attribute value (Object)). TokensRegex supports matching attributes by specifying the key and the value to be matched. Each token is indicated by [ <expression> ] where <expression> specifies how the attributes should be matched.

Basic expression - Basic expressions are enclosed with {} and have the form { <attr1>; <attr2>...}, where each <attr> specifies the <name> <matchfunc> <value>. See Attribute Match Expression for a summary of the constructs for matching attributes, and Annotation Keys for the attribute names.

As a shorthand, the token text can be matched directly by using "" (for exact string match) or // (for regular expression match).

Compound Expressions - Compound expressions are formed using !, &, and |.

Annotation Keys

Standard names for annotation keys:

NameAnnotation Class
wordCoreAnnotations.TextAnnotation
tagCoreAnnotations.PartOfSpeechTagAnnotation
lemmaCoreAnnotations.LemmaAnnotation
nerCoreAnnotations.NamedEntityTagAnnotation
normalizedCoreAnnotations.NormalizedNamedEntityTagAnnotation

Attribute Match Expressions

SymbolMeaning
All
[]Any token
Strings
"abc"The text of the token matches the string abc exactly.
/abc/The text of the token matches the regular expression specified by abc.
{ key:"abc" }The token annotation corresponding to key matches the string abc exactly.
{ key:/abc/ }The token annotation corresponding to key matches the regular expression specified by abc.
Numerics
{ key==number }The token annotation corresponding to key is equal to number.
{ key!=number }The token annotation corresponding to key is not equal to number.
{ key>number }The token annotation corresponding to key is greater than number.
{ key<number }The token annotation corresponding to key is less than number.
{ key>=number }The token annotation corresponding to key is greater than or equal to number.
{ key<=number }The token annotation corresponding to key is less than or equal to number.
Boolean checks
{ key::IS_NUM } The token annotation corresponding to key is a number.
{ key::IS_NIL } or { key::NOT_EXISTS } The token annotation corresponding to key does not exist.
{ key::NOT_NIL } or { key::EXISTS } The token annotation corresponding to key exist.

Defining regular expression for matching multiple tokens

Sometimes it is useful to be able to match string sequences across tokens irrespective of the tokenization.

SymbolMeaning
(?m){n,m} /abc/The text of the a sequence of tokens (between n to m tokens long) should matches the regular expression specified by abc. The tokens are concantenated for every match so avoid using with long sequences.

Defining regular expressions over token sequences

TokensRegex supports a similar set of constructs for matching sequences of tokens as regular expressions over strings.

SymbolMeaning
X YX followed by Y
X | YX or Y
X & YX and Y
Groups
(X)X as a capturing group
(?$name X)X as a capturing group with name name
(?: X)X as a non-capturing group
Greedy quantifiers
X?X, once or not at all
X*X, zero or more times
X+X, one or more times
X{n}X, exactly n times
X{n,}X, at least n times
X{n,m}X, at least n times but no more than m times
Reluctant quantifiers
X??X, once or not at all
X*?X, zero or more times
X+?X, one or more times
X{n}?X, exactly n times
X{n,}?X, at least n times
X{n,m}?X, at least n times but no more than m times

Summary of the different bracketing symbols used in TokenRegex

Download

TokensRegex is integrated in the Stanford suite of NLP tools, StanfordCoreNLP. Please download the entire suite from this page.

Questions

Questions, feedback, and bug reports/fixes can be sent to our mailing lists.

Mailing Lists

We have 3 mailing lists for TokensRegex, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu:

  1. java-nlp-user This is the best list to post to in order to ask questions, make announcements, or for discussion among JavaNLP users. You have to subscribe to be able to use it. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.
  2. java-nlp-announce This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 messages a year). Join the list via this webpage or by emailing java-nlp-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
  3. java-nlp-support This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, you're better off joining and using java-nlp-user. You cannot join java-nlp-support, but you can mail questions to java-nlp-support@lists.stanford.edu.

Release History

Version 1.3.2 2012-05-22 Added TokensRegexAnnotator that can be configured using rules file
Version 1.2.0 2011-09-14 Initial version of TokensRegex library