|
|
NERDemo.java
included in the 2008-05-07 version of the software?
On Unix/Linux (or command-line Mac OS X), use GNU tar, if you're not already. (If you're using
Linux, you're almost certainly using GNU tar.) For some reason we don't
understand, it doesn't seem to unpack with classic Unix tar. Make sure you
specify the -z option if you are not gunzipping it in
advance: tar -xzf stanford-ner-2009-01-16.tgz.
On Windows, it unpacks fine with most common tools, such as WinZip or 7-Zip. The latter is open source. (As of Sep 2007, WinRAR doesn't work: it apparently does not handle tar files correctly.)
On the Mac, just double-click it to unpack. The default unarchiver (BOMArchiveHelper) works fine.
If it won't unpack, you normally have either a corrupted download (try downloading it again), a defective tool for unpacking tar files (normally evidenced by Unix file protection bits like 0664 appearing in file names), or there is some configuration error on your system, which we can't help with.
The documentation for training your own classifier is certainly
somewhere between bad and non-existent. But nevertheless, everything
you need is in the box, and you should
look through the Javadoc for at least the classes
CRFClassifier and
NERFeatureFactory.
Basically, the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions....
Here's a sample NER properties file:
trainFile = training-data.col serializeTo = ner-model.ser.gz map = word=0,answer=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useSequences=true usePrevSequences=true maxLeft=1 useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC useDisjunctive=true
Oh, okay. Here's an example. Suppose we want to build an NER system for Jane Austen novels. We might train it on chapter 1 of Emma. Download that file. You can convert it to one token per line with our tokenizer (included in the box) with the following command:
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer
jane-austen-emma-ch1.txt > jane-austen-emma-ch1.tok
We then need to make training data where we label the entities.
There are various annotation tools available, or you could do this by
hand in a text editor. One way is to default to making everything an
other (for which the default label is "O" in our software, though you
can specify it via the backgroundSymbol property) and then to
hand-label the real entities in a text editor. The first step can be
done with Perl using this command:
perl -ne 'chomp; print "$_\tO\n"' jane-austen-emma-ch1.tok > jane-austen-emma-ch1.tsv
and if you don't want to do the second, you can skip to downloading our input file. We have marked only one entity type, PERS for person name, but you could easily add a second entity type such as LOC for location, to this data.
You will then also want some test data to see how well the system is doing. You can download the text of chapter 2 of Emma and the gold standard annotated version of chapter 2.
Stanford NER CRF allows all properties to be specified on the command line, but it is easier to use a properties file. Here is a simple properties file (pretty much like the one above!), but explanations for each line are in comments, specified by "#":
# location of the training file trainFile = jane-austen-emma-ch1.tsv # location where you would like to save (serialize) your # classifier; adding .gz at the end automatically gzips the file, # making it faster and smaller serializeTo = ner-model.ser.gz # structure of your training file; this tells the classifier that # the word is in column 0 and the correct answer is in column 1 map = word=0,answer=1 # This specifies the order of the CRF: order 1 means that features # apply at most to a class pair of previous class and current class # or current class and next class. maxLeft=1 # these are the features we'd like to train with # some are discussed below, the rest can be # understood by looking at NERFeatureFactory useClassFeature=true useWord=true # word character ngrams will be included up to length 6 as prefixes # and suffixes only useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true # the last 4 properties deal with word shape features useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC
Here is that properties file as a downloadable link: austen.prop.
Once you have such a properties file, you can train a classifier with the command:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
An NER model will then be serialized to the location specified in the
properties file (ner-model.ser.gz)
once the program has completed. To check how well it works, you can run the test command:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile jane-austen-emma-ch2.tsv
The first column are the input tokens, the second column is the correct (gold) answers, and the third column is the answer guessed by the classifier. By looking at the output, you can see that the classifier finds most of the person named entities but not all due to the limited training data and limited features. (This code does not itself evaluate accuracy, precision, recall, etc. - you'll need to write or borrow your your own script for that. A commonly used one is the Perl scrip conlleval).
So how do you apply this to make your own non-example NER model? You
need 1) a training data source, 2) a properties file specifying the
features you want to use, and (optional, but nice) 3) a test file
to see how you're doing. For the training data source, you need each word
to be on a separate line and annotated with the correct answer; all
columns must be tab-separated. If you want to explicitly specify more
features for the word, you can add these in the file in a new column and
then put the appropriate structure of your file in the map line in the
properties file. For example, if you added a third column to your data
with a new feature, you might write "map= word=0, answer=1,
mySpecialFeature=2". Right now, most arbitrarily named features (like
mySpecialFeature) will not work without making modifications to
the source code, but we are working on adding this feature. In the
meantime, there are known names that do work, like tag, lemma,
chunk, web.
Once you've
annotated your data, you make a properties file with the features you
want. You can use the example properties file, and refer to the
NERFeatureFactory for more possible features. Finally, you can test on
your annotated test data as shown above or annotate more text using the
-textFile command rather than -testFile.
Here are some tips on memory usage for CRFClassifier:
java -mx2g.qnSize. The
default is 25. Using 10 is perfectly adequate. If you're short of
memory, things will still work with much smaller values, even just a
value of 2.saveFeatureIndexToDisk = true. The feature
names aren't actually needed while the core model estimation
(optimization) code is run. This option saves them to a file
before the optimizer runs, enabling the memory they use to be freed, and
then loads the feature index from disk after optimization is finished.maxLeft=1 and no features that refer to the
answer class more than one away - it's okay to refer to
word features any distance away). While the code supports arbitrary order CRFs,
building second, third, or fourth order CRFs will greatly increase
memory usage and normally isn't necessary. Remember:
maxLeft refers to the size of the class contexts that your
features use (that is, it is one smaller than the clique size). A first
order CRF can still look arbitrarily far to the left or right to get
information about the observed data context.printFeatures to
true. CRFClassifier will then write (potentially huge)
files in the current directory listing the features generated for each
token position. Options that generate huge numbers of features include
useWordPairs and useNGrams when
maxNGramLeng is a large number.flag useObservedSequencesOnly=true.
This makes it so that you can only label adjacent words with label sequences
that were seen next to each other in the training data. For some
kinds of data this actually gives better accuracy, for other kinds it is
worse. But unless the label sequence patterns are dense, it will reduce
your memory usage.featureDiffThresh, for example
featureDiffThresh=0.05. In training, CRFClassifier will
train one model, drop all the features with weight (absolute value)
beneath the given threshold, and then train a second model. Training
thus takes longer, but the resulting model is smaller and faster at
runtime, and usually has very similar performance for a reasonable
threshold such as 0.05.
Typically you would load a classifier from disk with the
CRFClassifier.getClassifier() method and then use it to
classify some text. See the example NERDemo file. The
two most flexible classification methods to call are called
classify(). These return a
List<CoreLabel>, or a list of those, and take the
same type or a String, respectively. A CoreLabel has
everything you could need: the original token, its (Americanized, Penn
Treebank) normalized form used in the system, its begin and end character offsets, a
record of the whitespace around it, and the class assigned to the
token. Print some out and have a look. There are also a number of
other classification methods that take a String of text as input, and
provide various forms of user-friendly output. The method
classifyToCharacterOffsets returns a list of triples of an
entity name and its begin and end character offsets. The method
classifyToString(String, String, boolean) will return you a
String with NER-classified text in one of several formats (plain text or
XML) with
or without token normalization and the preservation of spacing versus
tokenized. One of the versions of it may well do what you would like to see.
Again, see NERDemo
for examples of the use of several (but not all) of these methods.
Yes! But you'll need to make your own custom jar file. If you insert
into the jar file an NER model with name myModel and you put it
inside the jar file under the /classifiers/ path as
/classifiers/myModel, then you can load it when running
from a jar file with a command like:
java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier
-loadJarClassifier myModel -textFile sample.txt
You might also be interested in looking at
edu.stanford.nlp.ie.NERServer as an example of having the
CRFClassifier run on a socket and wait for text to annotate and
then returning the results. Here's a complete Unix/Linux example, run from inside
the folder of the distribution:
cp stanford-ner.jar stanford-ner-with-classifier.jar
jar -uf stanford-ner-with-classifier.jar classifiers/ner-eng-ie.crf-3-all2008.ser.gz
java -mx500m -cp stanford-ner-with-classifier.jar edu.stanford.nlp.ie.NERServer
-loadJarClassifier ner-eng-ie.crf-3-all2008.ser.gz 9191
With a bit of work, we're sure you can adapt that example to work in a REST, SOAP, AJAX, or whatever system. If not, pay us a lot of money, and we'll work it out for you.
In recent versions of our NER code, we use the typesafe heterogeneous container pattern that Josh Bloch has talked about in various places such as this talk. It's neat but somewhat stresses the implementation of generic types in Java. The code is correct and should compile okay. It does compile okay with current versions of Sun javac v1.5 or v1.6 and with the current version of the Eclipse compiler. If it doesn't compile for you, you should upgrade your Java compiler or complain to the person who makes it.
NERDemo.java
included in the 2008-05-07 version of the software?
In this release (only) we made a booboo, and didn't update the
NERDemo.java file to correspond to changes in the main
code, and the supplied code doesn't compile.
Here is a downloadable version of NERDemo.java,
which will work with that version. The commands you need to use it are
:
javac -cp "stanford-ner.jar:." NERDemo.java
java -cp "stanford-ner.jar:." NERDemo
java -mx400m -cp "stanford-ner.jar:." NERDemo classifiers/ner-eng-ie.crf-3-all2008-distsim.ser.gz myFile.txt
(These are the commands for Linux or the Mac OS X command-line; for
Windows, replace the colons above with semicolons and the slash with a backslash.)
Several options are available from the command-line for determining the output format of the classifier.
You can choose an outputFormat of xml, inlineXML,
or slashTags (the default). See the example of each below (these are bash
shell command lines, the last bit of which suppresses message printing, so you can see just the output).
Even more power is available if you are using the API.
The classifier.classifyToString(String text, String outputFormat, boolean preserveSpaces)
method supports 6 output styles (of which 3 are available with the outputFormat property:
the XML output options preserve spaces, but the slash tags one doesn't). Even more flexibility can be
obtained by using other of the classify.* methods in the API. These will return
classified versions of the input, which you can print out however your heart desires!
There are also methods like
classifyToCharacterOffsets(String) which returns just the
entity spans.
See the examples in NERDemo.java.
$ cat PatrickYe.txt I complained to Microsoft about Bill Gates. They told me to see the mayor of New York. $ $ java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/ner-eng-ie.crf-3-all2008-distsim.ser.gz -textFile PatrickYe.txt -outputFormat slashTags 2> /dev/null I/O complained/O to/O Microsoft/ORGANIZATION about/O Bill/PERSON Gates/PERSON ./O They/O told/O me/O to/O see/O the/O mayor/O of/O New/LOCATION York/LOCATION ./O $ $ java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/ner-eng-ie.crf-3-all2008-distsim.ser.gz -textFile PatrickYe.txt -outputFormat inlineXML I complained to <ORGANIZATION>Microsoft</ORGANIZATION> about <PERSON>Bill Gates</PERSON>. They told me to see the mayor of <LOCATION>New York</LOCATION>. $ $ java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/ner-eng-ie.crf-3-all2008-distsim.ser.gz -textFile PatrickYe.txt -outputFormat xml <wi num="0" entity="O">I</wi> <wi num="1" entity="O">complained</wi> <wi num="2" entity="O">to</wi> <wi num="3" entity="ORGANIZATION">Microsoft</wi> <wi num="4" entity="O">about</wi> <wi num="5" entity="PERSON">Bill</wi> <wi num="6" entity="PERSON">Gates</wi><wi num="7" entity="O">.</wi> <wi num="0" entity="O">They</wi> <wi num="1" entity="O">told</wi> <wi num="2" entity="O">me</wi> <wi num="3" entity="O">to</wi> <wi num="4" entity="O">see</wi> <wi num="5" entity="O">the</wi> <wi num="6" entity="O">mayor</wi> <wi num="7" entity="O">of</wi> <wi num="8" entity="LOCATION">New</wi> <wi num="9" entity="LOCATION">York</wi><wi num="10" entity="O">.</wi>
You can discuss other topics with Stanford NER developers and users by
joining
the java-nlp-user mailing list
(via a webpage). Or you can send other questions and feedback to
java-nlp-support@lists.stanford.edu.
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · ScalaNLP · machines · Wiki |
Site design by Bill MacCartney |