|
|
The documentation for training your own classifier is certainly
somewhere between bad and non-existent. But nevertheless, you should
look through the Javadoc for at least the classes
CRFClassifier and
NERFeatureFactory.
Basically, the training data should be in tab-separated columns, and you define the meaning of those columns via a map. You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions....
Here's a sample NER properties file:
trainFile = training-data.col serializeTo = ner-model.ser.gz map = word=0,answer=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useSequences=true usePrevSequences=true maxLeft=1 useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC useDisjunctive=true
Oh, okay. Here's an example. Suppose we want to build a NER system for Jane Austen novels. We might train it on chapter 1 of Emma. Download that file. You can convert it to one token per line with our tokenizer with the following command:
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer
jane-austen-emma-ch1.txt > jane-austen-emma-ch1.tok
We then need to make training data where we label the entities. There are various annotation tools available, or you could do this by hand in a text editor. One way is to default to making everything an other (for which the default label is "O" in our software) and then to hand-label the real entities in a text editor. The first step can be done with Perl using this command:
perl -ne 'chomp; print "$_\tO\n"' jane-austen-emma-ch1.tok
and if you don't want to do the second, you can skip to downloading our input file. We have marked only one entity type, PERS for person name, but you could easily add a second entity type such as LOC for location, to this data.
You will then also want some test data to see how well the system is doing. You can download the text of chapter 2 of Emma or the gold standard annotated version of chapter 2.
Stanford NER CRF will allow all properties to be specified on the command line, but it is easier to use a properties file. Here is a simple Stanford NER CRF will allow all properties to be specified on the command line, but it is easier to use a properties file. Here is a simple properties file (pretty much like the one above!), but explanations for each line are in comments, specified by "#":
#location of the training file trainFile = jane-austen-emma-ch1.tsv #location where you would like to save (serialize to) your #classifier; adding .gz at the end automatically gzips the file, #making it faster and smaller serializeTo = ner-model.ser.gz #structure of your training file; this tells the classifier #that the word is in column 0 and the correct answer is in #column 1 map = word=0,answer=1 #these are the features we'd like to train with #some are discussed below, the rest can be #understood by looking at NERFeatureFactory useClassFeature=true useWord=true useNGrams=true #no ngrams will be included that do not contain either the #beginning or end of the word noMidNGrams=true useDisjunctive=true maxNGramLeng=6 usePrev=true useNext=true useSequences=true usePrevSequences=true maxLeft=1 #the next 4 deal with word shape features useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC
Once you make such a properties file, you can train a classifier with the command:
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop propertiesFile.txt
(where "propertiesFile.txt" should be the location of the properties file you just created)
An NER model will have been serialized to the location you specified once the program has completed. To check how well it works, you can run the test command:
By looking at the output, you can see that (at least with the person annotation) the classifier finds most of the named entities but not all due to small training data and limited features.
So how do you apply this to make your own non-example NER model? You need 1) a training data source, 2) a properties file specifying the features you want to use, and (optional, but often nice) 3) a test file see how you're doing. For the training data source, you need each word to be on a separate line and annotated with the correct answer; all columns must be tab-separated. If you want to explicitly specify more features for the word, you can add these in the file in a new column and then put the appropriate structure of your file in the map line in the properties file. For example, if you added a third column to your data with a new feature, you might write "map= word=0, answer=1, mySpecialFeature=2". Right now, most arbitrarily named features (like mySpecialFeature) will not work without making modifications to the source code, but we are working on adding this feature. Once you've annotated your data, you make a properties file with the features you want. You can use the example properties file, and refer to the NERFeatureFactory for more possible features. Finally, you can test on your annotated test data as shown above or annotate more text using the -textFile command rather than -testFile.
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile jane-austen-emma-ch2.tsv
Please send any other questions or feedback, or extensions and bugfixes to
java-nlp-support@lists.stanford.edu.
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · machines · Wiki |
Site design by Bill MacCartney |