Skip navigation links

Package edu.stanford.nlp.ling.tokensregex

This package contains a library, TokensRegex, for matching regular expressions over tokens.

See: Description

Package edu.stanford.nlp.ling.tokensregex Description

This package contains a library, TokensRegex, for matching regular expressions over tokens. TokensRegex is incorporated into the TokensRegexAnnotator, the TokensRegexNERAnnotator, and the SUTime functionality in NERCombinerAnnotator.

Rules for extracting expression using TokensRegex

TokensRegex provides a language for specifying rules to extract expressions over a token sequence.

CoreMapExpressionExtractor and SequenceMatchRules describes the language and how the extraction rules are created.

Core classes for token sequence matching using TokensRegex

At the core of TokensRegex are the TokenSequenceMatcher and TokenSequencePattern classes which can be used to match patterns over a sequences of tokens. The usage is designed to follow the paradigm of the Java regular expression library java.util.regex. The usage is similar except that matches are done over List<CoreMap> instead of over String.

Example:

  List<CoreLabel> tokens = ...;
 TokenSequencePattern pattern = TokenSequencePattern.compile(...);
 TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
 

The classes SequenceMatcher and SequencePattern can be used to build classes for recognizing regular expressions over sequences of arbitrary types.

Utility classes

TokensRegex also offers a group of utility classes.

MultiPatternMatcher provides utility functions for finding expressions with multiple patterns. For instance, using MultiPatternMatcher.findNonOverlapping(java.util.List<? extends T>) you can find all nonoverlapping subsequences for a given set of patterns.

To find character offsets of multiple word expressions in a String, you can also use MultiWordStringMatcher.findTargetStringOffsets(java.lang.String, java.lang.String).

Author:
Angel Chang (angelx@stanford.edu)
Skip navigation links

Stanford NLP Group