Stanford Segmenter FAQ

Questions

  1. How can I retrain the Chinese Segmenter?

Please send any other questions or feedback, or extensions and bugfixes to java-nlp-user@lists.stanford.edu.


Questions with answers

  1. How can I retrain the Chinese Segmenter?

    In general you need four things in order to retrain the Chinese Segmenter. You will need a data set with segmented text, a dictionary with words that the segmenter should know about, and various small data files for other feature generators.

    The most important thing you need is a data file with text segmented according to the standard you want to use. For example, for the CTB model we distribution, which follows the Penn Chinese Treebank segmentation standard, we use the Chinese Treebank 7.0 data set.

    You will need to convert your data set to text in the following format:

    中国 进出口 银行 与 中国 银行 加强 合作
    新华社 北京 十二月 二十六日 电 ( 记者 周根良 )
    ...
    
    Each individual sentence is on its own line, and spaces are used to denote word breaks.

    Some data sets will come in the format of Penn trees. There are various ways to convert this to segmented text; one way which uses our tool suite is to use the Treebanks tool:

    java edu.stanford.nlp.trees.Treebanks -words ctb7.mrg
    
    The Treebanks tool is not included in the segmenter download, but it is available in the corenlp download.

    Another useful tool is a dictionary of known words. This should include named entities such as people, places, companies, etc. which the model might segment as a single word. This is not actually required, but it will help identity named entities which the segmenter has not seen before. For example, our file of named entities includes names such as

    吳毅成
    吳浩康
    吳淑珍
    ...
    

    To build a dictionary usable by our model, you want to collect lists of words and then use the ChineseDictionary tool to combine them into one serialized dictionary.

    java edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts names.txt,places.txt,words.txt,... -output dict.ser.gz
    
    If you want to use our existing dictionary as a starting point, you can include it as one of the filenames. Words have a maximum lexicon length (probably 6, see the ChineseDictionary source) and words longer than that will be truncated. There is also handling of words with a "mid dot" character; this occasionally shows up in machine translation, and if a word with a mid dot shows up in the dictionary, we accept the word either with or without the dot.

    You will also need a properties file which tells the classifier which features to use. An example properties file is included in the data directory of the segmenter download.

    Finally, some of the features used by the existing models require additional data files, which are included in the data/dict directory of the segmenter download. To figure out which files correspond to which features, please search in the source code for the appropriate filename. You can probably just reuse the existing data files.

    Once you have all of these components, you can then train a new model with a command line such as

      java -mx15g edu.stanford.nlp.ie.crf.CRFClassifier -prop ctb.prop -serDictionary dict-chris6.ser.gz -sighanCorporaDict data -trainFile train.txt -serializeTo newmodel.ser.gz > newmodel.log 2> newmodel.err
    
    It really does take a lot of memory to train a new classifier. The NER FAQ presents some tips for using less memory, if needed.