edu.stanford.nlp.parser.lexparser
Class ChineseLexiconAndWordSegmenter

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.ChineseLexiconAndWordSegmenter
All Implemented Interfaces:
Lexicon, WordSegmenter, java.io.Serializable

public class ChineseLexiconAndWordSegmenter
extends java.lang.Object
implements Lexicon, WordSegmenter

This class lets you train a lexicon and segmenter at the same time.

Author:
Galen Andrew, Pi-Chuan Chang
See Also:
Serialized Form

Field Summary
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
ChineseLexiconAndWordSegmenter(ChineseLexicon lex, WordSegmenter seg)
           
ChineseLexiconAndWordSegmenter(java.lang.String segmenterFileOrUrl, Options op)
          Construct a new ChineseLexiconAndWordSegmenter.
ChineseLexiconAndWordSegmenter(Treebank trainTreebank, Options op)
           
 
Method Summary
static ChineseLexiconAndWordSegmenter getSegmenterDataFromFile(java.lang.String parserFileOrUrl, Options op)
           
protected static ChineseLexiconAndWordSegmenter getSegmenterDataFromSerializedFile(java.lang.String serializedFileOrUrl)
           
 UnknownWordModel getUnknownWordModel()
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(java.lang.String word)
          Checks whether a word is in the lexicon.
 void loadSegmenter(java.lang.String filename)
           
static void main(java.lang.String[] args)
          This method lets you train and test a segmenter relative to a Treebank.
 int numRules()
          Returns the number of rules (tag rewrites as word) in the Lexicon.
 void readData(java.io.BufferedReader in)
          Read the lexicon from the BufferedReader in the format written by writeData.
 java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc)
          Get an iterator over all rules (pairs of (word, POS)) for this word.
 float score(IntTaggedWord iTW, int loc)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 Sentence<Word> segmentWords(java.lang.String s)
           
 void setUnknownWordModel(UnknownWordModel uwm)
           
 void train(java.util.Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void writeData(java.io.Writer w)
          Write the lexicon in human-readable format to the Writer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChineseLexiconAndWordSegmenter

public ChineseLexiconAndWordSegmenter(ChineseLexicon lex,
                                      WordSegmenter seg)

ChineseLexiconAndWordSegmenter

public ChineseLexiconAndWordSegmenter(Treebank trainTreebank,
                                      Options op)

ChineseLexiconAndWordSegmenter

public ChineseLexiconAndWordSegmenter(java.lang.String segmenterFileOrUrl,
                                      Options op)
Construct a new ChineseLexiconAndWordSegmenter. This loads a segmenter file that was previously assembled and stored.

Throws:
java.lang.IllegalArgumentException - If segmenter data cannot be loaded
Method Detail

segmentWords

public Sentence<Word> segmentWords(java.lang.String s)
Specified by:
segmentWords in interface WordSegmenter

isKnown

public boolean isKnown(int word)
Description copied from interface: Lexicon
Checks whether a word is in the lexicon.

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(java.lang.String word)
Description copied from interface: Lexicon
Checks whether a word is in the lexicon.

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                            int loc)
Description copied from interface: Lexicon
Get an iterator over all rules (pairs of (word, POS)) for this word.

Specified by:
ruleIteratorByWord in interface Lexicon
Parameters:
word - The word, represented as an integer in Numberer
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

numRules

public int numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.

Specified by:
numRules in interface Lexicon
Returns:
The number of rules (tag rewrites as word) in the Lexicon.

train

public void train(java.util.Collection<Tree> trees)
Description copied from interface: Lexicon
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon
Specified by:
train in interface WordSegmenter
Parameters:
trees - Trees to train on

score

public float score(IntTaggedWord iTW,
                   int loc)
Description copied from interface: Lexicon
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)

Specified by:
score in interface Lexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
Returns:
A score, usually, log P(word|tag)

loadSegmenter

public void loadSegmenter(java.lang.String filename)
Specified by:
loadSegmenter in interface WordSegmenter

readData

public void readData(java.io.BufferedReader in)
              throws java.io.IOException
Description copied from interface: Lexicon
Read the lexicon from the BufferedReader in the format written by writeData. (An optional operation.)

Specified by:
readData in interface Lexicon
Parameters:
in - The BufferedReader to read from
Throws:
java.io.IOException - If any I/O problem

writeData

public void writeData(java.io.Writer w)
               throws java.io.IOException
Description copied from interface: Lexicon
Write the lexicon in human-readable format to the Writer. (An optional operation.)

Specified by:
writeData in interface Lexicon
Parameters:
w - The writer to output to
Throws:
java.io.IOException - If any I/O problem

getSegmenterDataFromFile

public static ChineseLexiconAndWordSegmenter getSegmenterDataFromFile(java.lang.String parserFileOrUrl,
                                                                      Options op)

getSegmenterDataFromSerializedFile

protected static ChineseLexiconAndWordSegmenter getSegmenterDataFromSerializedFile(java.lang.String serializedFileOrUrl)

main

public static void main(java.lang.String[] args)
This method lets you train and test a segmenter relative to a Treebank.

Implementation note: This method is largely cloned from LexicalizedParser's main method. Should we try to have it be able to train segmenters to stop things going out of sync?


getUnknownWordModel

public UnknownWordModel getUnknownWordModel()
Specified by:
getUnknownWordModel in interface Lexicon

setUnknownWordModel

public void setUnknownWordModel(UnknownWordModel uwm)
Specified by:
setUnknownWordModel in interface Lexicon


Stanford NLP Group