ChineseLexiconAndWordSegmenter (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.ChineseLexiconAndWordSegmenter

All Implemented Interfaces:

Lexicon, WordSegmenter, java.io.Serializable
```
public class ChineseLexiconAndWordSegmenter
extends java.lang.Object
implements Lexicon, WordSegmenter
```
This class lets you train a lexicon and segmenter at the same time.

Author:

Galen Andrew, Pi-Chuan Chang

See Also:

Serialized Form

Field Summary
- Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
  BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD

Constructor Summary

Constructors
Constructor and Description
`ChineseLexiconAndWordSegmenter(ChineseLexicon lex, WordSegmenter seg)`
`ChineseLexiconAndWordSegmenter(java.lang.String segmenterFileOrUrl, Options op)` Construct a new ChineseLexiconAndWordSegmenter.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`finishTraining()` Done collecting statistics for the lexicon.
`static ChineseLexiconAndWordSegmenter`	`getSegmenterDataFromFile(java.lang.String parserFileOrUrl, Options op)`
`protected static ChineseLexiconAndWordSegmenter`	`getSegmenterDataFromSerializedFile(java.lang.String serializedFileOrUrl)`
`UnknownWordModel`	`getUnknownWordModel()`
`void`	`incrementTreesRead(double weight)` If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
`void`	`initializeTraining(double numTrees)` Start training this lexicon on the expected number of trees.
`boolean`	`isKnown(int word)` Checks whether a word is in the lexicon.
`boolean`	`isKnown(java.lang.String word)` Checks whether a word is in the lexicon.
`void`	`loadSegmenter(java.lang.String filename)`
`static void`	`main(java.lang.String[] args)` This method lets you train and test a segmenter relative to a Treebank.
`int`	`numRules()` Returns the number of rules (tag rewrites as word) in the Lexicon.
`void`	`readData(java.io.BufferedReader in)` Read the lexicon from the BufferedReader in the format written by writeData.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)` Get an iterator over all rules (pairs of (word, POS)) for this word.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)` Same thing, but with a string that needs to be translated by the lexicon's word index
`float`	`score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)` Get the score of this word with this tag (as an IntTaggedWord) at this loc.
`java.util.List<HasWord>`	`segment(java.lang.String s)`
`void`	`setUnknownWordModel(UnknownWordModel uwm)`
`java.util.Set<java.lang.String>`	`tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)` Return the Set of tags used by this tagger (available after training the tagger).
`void`	`train(java.util.Collection<Tree> trees)` Trains this lexicon on the Collection of trees.
`void`	`train(java.util.Collection<Tree> trees, java.util.Collection<Tree> rawTrees)`
`void`	`train(java.util.Collection<Tree> trees, double weight)`
`void`	`train(java.util.List<TaggedWord> sentence)`
`void`	`train(java.util.List<TaggedWord> sentence, double weight)` Not all subclasses support this particular method.
`void`	`train(TaggedWord tw, int loc, double weight)` Not all subclasses support this particular method.
`void`	`train(Tree tree)`
`void`	`train(Tree tree, double weight)`
`void`	`trainUnannotated(java.util.List<TaggedWord> sentence, double weight)` Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
`void`	`writeData(java.io.Writer w)` Write the lexicon in human-readable format to the Writer.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ChineseLexiconAndWordSegmenter
```
public ChineseLexiconAndWordSegmenter(ChineseLexicon lex,
                                      WordSegmenter seg)
```
  - ChineseLexiconAndWordSegmenter
```
public ChineseLexiconAndWordSegmenter(java.lang.String segmenterFileOrUrl,
                                      Options op)
```
    Construct a new ChineseLexiconAndWordSegmenter. This loads a segmenter file that was previously assembled and stored.
    
    Throws:
    
    java.lang.IllegalArgumentException - If segmenter data cannot be loaded
- Method Detail
  - segment
```
public java.util.List<HasWord> segment(java.lang.String s)
```
    Specified by:
    
    segment in interface WordSegmenter
  - isKnown
```
public boolean isKnown(int word)
```
    Description copied from interface: Lexicon
    
    Checks whether a word is in the lexicon.
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as an int
    
    Returns:
    
    Whether the word is in the lexicon
  - isKnown
```
public boolean isKnown(java.lang.String word)
```
    Description copied from interface: Lexicon
    
    Checks whether a word is in the lexicon.
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as a String
    
    Returns:
    
    Whether the word is in the lexicon
  - tagSet
```
public java.util.Set<java.lang.String> tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
```
    Return the Set of tags used by this tagger (available after training the tagger).
    
    Specified by:
    
    tagSet in interface Lexicon
    
    Returns:
    
    The Set of tags used by this tagger
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Get an iterator over all rules (pairs of (word, POS)) for this word.
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
    
    Parameters:
    
    word - The word, represented as an integer in Index
    
    loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
    
    featureSpec - Additional word features like morphosyntactic information.
    
    Returns:
    
    An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Same thing, but with a string that needs to be translated by the lexicon's word index
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
  - numRules
```
public int numRules()
```
    Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.
    
    Specified by:
    
    numRules in interface Lexicon
    
    Returns:
    
    The number of rules (tag rewrites as word) in the Lexicon.
  - initializeTraining
```
public void initializeTraining(double numTrees)
```
    Description copied from interface: Lexicon
    
    Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)
    
    Specified by:
    
    initializeTraining in interface Lexicon
    
    Specified by:
    
    initializeTraining in interface WordSegmenter
  - train
```
public void train(java.util.Collection<Tree> trees)
```
    Description copied from interface: Lexicon
    
    Trains this lexicon on the Collection of trees. Can be called more than once with different collections of trees.
    
    Specified by:
    
    train in interface Lexicon
    
    Specified by:
    
    train in interface WordSegmenter
    
    Parameters:
    
    trees - Trees to train on
  - train
```
public void train(java.util.Collection<Tree> trees,
                  double weight)
```
    Specified by:
    
    train in interface Lexicon
  - train
```
public void train(Tree tree)
```
    Specified by:
    
    train in interface WordSegmenter
  - train
```
public void train(Tree tree,
                  double weight)
```
    Specified by:
    
    train in interface Lexicon
  - train
```
public void train(java.util.List<TaggedWord> sentence)
```
    Specified by:
    
    train in interface WordSegmenter
  - train
```
public void train(java.util.List<TaggedWord> sentence,
                  double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - trainUnannotated
```
public void trainUnannotated(java.util.List<TaggedWord> sentence,
                             double weight)
```
    Description copied from interface: Lexicon
    
    Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
    
    Specified by:
    
    trainUnannotated in interface Lexicon
  - incrementTreesRead
```
public void incrementTreesRead(double weight)
```
    Description copied from interface: Lexicon
    
    If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
    
    Specified by:
    
    incrementTreesRead in interface Lexicon
  - train
```
public void train(TaggedWord tw,
                  int loc,
                  double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - finishTraining
```
public void finishTraining()
```
    Description copied from interface: Lexicon
    
    Done collecting statistics for the lexicon.
    
    Specified by:
    
    finishTraining in interface Lexicon
    
    Specified by:
    
    finishTraining in interface WordSegmenter
  - score
```
public float score(IntTaggedWord iTW,
                   int loc,
                   java.lang.String word,
                   java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)
    
    Specified by:
    
    score in interface Lexicon
    
    Parameters:
    
    iTW - An IntTaggedWord pairing a word and POS tag
    
    loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
    
    word - The word itself; useful so we don't have to look it up in an index
    
    featureSpec - TODO
    
    Returns:
    
    A score, usually, log P(word|tag)
  - loadSegmenter
```
public void loadSegmenter(java.lang.String filename)
```
    Specified by:
    
    loadSegmenter in interface WordSegmenter
  - readData
```
public void readData(java.io.BufferedReader in)
              throws java.io.IOException
```
    Description copied from interface: Lexicon
    
    Read the lexicon from the BufferedReader in the format written by writeData. (An optional operation.)
    
    Specified by:
    
    readData in interface Lexicon
    
    Parameters:
    
    in - The BufferedReader to read from
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - writeData
```
public void writeData(java.io.Writer w)
               throws java.io.IOException
```
    Description copied from interface: Lexicon
    
    Write the lexicon in human-readable format to the Writer. (An optional operation.)
    
    Specified by:
    
    writeData in interface Lexicon
    
    Parameters:
    
    w - The writer to output to
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - getSegmenterDataFromFile
```
public static ChineseLexiconAndWordSegmenter getSegmenterDataFromFile(java.lang.String parserFileOrUrl,
                                                                      Options op)
```
  - getSegmenterDataFromSerializedFile
```
protected static ChineseLexiconAndWordSegmenter getSegmenterDataFromSerializedFile(java.lang.String serializedFileOrUrl)
```
  - main
```
public static void main(java.lang.String[] args)
```
    This method lets you train and test a segmenter relative to a Treebank.
    Implementation note: This method is largely cloned from LexicalizedParser's main method. Should we try to have it be able to train segmenters to stop things going out of sync?
  - getUnknownWordModel
```
public UnknownWordModel getUnknownWordModel()
```
    Specified by:
    
    getUnknownWordModel in interface Lexicon
  - setUnknownWordModel
```
public void setUnknownWordModel(UnknownWordModel uwm)
```
    Specified by:
    
    setUnknownWordModel in interface Lexicon
  - train
```
public void train(java.util.Collection<Tree> trees,
                  java.util.Collection<Tree> rawTrees)
```
    Specified by:
    
    train in interface Lexicon

Class ChineseLexiconAndWordSegmenter

Field Summary

Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ChineseLexiconAndWordSegmenter

ChineseLexiconAndWordSegmenter

Method Detail

segment

isKnown

isKnown

tagSet

ruleIteratorByWord

ruleIteratorByWord

numRules

initializeTraining

train

train

train

train

train

train

trainUnannotated

incrementTreesRead

train

finishTraining

score

loadSegmenter

readData

writeData

getSegmenterDataFromFile

getSegmenterDataFromSerializedFile

main

getUnknownWordModel

setUnknownWordModel

train