RegexNER

Here's a little example of what you can do with RegexNER. Let's start with a small file with information about Julia Gillard from Wikipedia. If it is run through Stanford CoreNLP with this command:

java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner' -file JuliaGillard.txt

then the output is this file (rendered in your browser with XSLT) ... or was in July 2013.

The output isn't bad, but we might want to improve it. For example, we might want to label university degrees as an entity. We can do that in a simple rule-based manner with RegexNER. Here's a simple example.

The simplest rule file has two tab-separated fields on a line. The first field contains a sequence of one or more space-separated regular expressions. If the regular expressions match a sequence of tokens, the tokens will be relabeled as the category in the second column. So our first RegexNER file is:

Bachelor of (Arts|Laws|Science|Engineering)	DEGREE

and we now run with this command, adding RegexNER:

java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -file JuliaGillard.txt -regexner.mapping jg-regexner.txt

Also, if you look at the original output, you will see there are a couple of mistakes. It misrecognizes Lalor as a PERSON, when it is a LOCATION (an electoral seat). And it sometimes fails to tag Labor as an ORGANIZATION when it appears not followed by Party. To fix the first error, you need one more concept: RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. This gives us this RegexNER file, which you can also download.

Bachelor of (Arts|Laws|Science|Engineering)	DEGREE
Lalor	LOCATION	PERSON
Labor	ORGANIZATION

Then the updated CoreNLP output can be found here.

Of course, these last two rules for relabeling tokens are rather dangerous. If you ran with this RegexNER file on the Wikipedia page for Kieran Lalor, then it would mess up the output badly. Similarly, the rule for Labor would be a bad idea in an article that discusses Labor unions.

If you want more control in checking for words in the context, or checking parts-of-speech, then you want to start looking into TokensRegex, which is a more powerful (but more complex) framework for writing rules for token labeling. Among other things, it lets you have a whole library of rule files. Though, in general, writing rules that cover all cases is a difficult enterprise! This is one reason why statistical classifiers have become dominant, because they are good at integrating various sources of evidence. But, nevertheless, a tool like RegexNER can be very useful as an overlay that corrects or augments the output of a statistical NLP system like Stanford NER.

This example showed doing things at the command line. But you can also easily use RegexNER in code. The RegexNER rules can be in a regular file, in a resource that is on your CLASSPATH, or even specified by a URL. You then specify to load RegexNER and where the RegexNER rules file is by providing an appropriate Properties object when creating Stanford CoreNLP:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", "org/foo/resources/jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);