edu.stanford.nlp.parser.lexparser
Class ChineseLexicon
java.lang.Object
edu.stanford.nlp.parser.lexparser.BaseLexicon
edu.stanford.nlp.parser.lexparser.ChineseLexicon
- All Implemented Interfaces:
- Lexicon, Serializable
public class ChineseLexicon
- extends BaseLexicon
A lexicon class for Chinese. Extends the (English) BaseLexicon class,
overriding its score and train methods to include a
ChineseUnknownWordModel.
- Author:
- Roger Levy
- See Also:
- Serialized Form
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon |
DEBUG_LEXICON, DEBUG_LEXICON_SCORE, lastSentencePosition, lastSignatureIndex, lastWordToSignaturize, nullTag, nullWord, rulesWithWord, seenCounter, smartMutation, smoothInUnknownsThreshold, tags, unknownLevel, unknownPrefixSize, unknownSuffixSize, unSeenCounter, words |
Method Summary |
float |
score(IntTaggedWord iTW,
int loc)
Get the score of this word with this tag (as an IntTaggedWord) at this
location. |
void |
train(Collection<Tree> trees)
Trains a lexicon on a collection of trees. |
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon |
addAll, addAll, addTagging, evaluateCoverage, getBaseTag, getSignature, getSignatureIndex, initRulesWithWord, isKnown, isKnown, listOfLabeledWordsToEvents, listToEvents, main, numRules, printLexStats, readData, ruleIteratorByWord, ruleIteratorByWord, train, train, train, trainWithExpansion, treeToEvents, treeToEvents, tune, writeData |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
useCharBasedUnknownWordModel
public static boolean useCharBasedUnknownWordModel
useGoodTuringUnknownWordModel
public static boolean useGoodTuringUnknownWordModel
ChineseLexicon
public ChineseLexicon(Options.LexOptions op)
train
public void train(Collection<Tree> trees)
- Trains a lexicon on a collection of trees.
- Specified by:
train
in interface Lexicon
- Overrides:
train
in class BaseLexicon
score
public float score(IntTaggedWord iTW,
int loc)
- Description copied from class:
BaseLexicon
- Get the score of this word with this tag (as an IntTaggedWord) at this
location. (Presumably an estimate of P(word | tag).)
Implementation documentation: Seen: c_W = count(W) c_TW =
count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half
total = count(seen words) totalUnseen = count("unseen" words) p_T_U =
Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) =
c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with
p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes
rule] Note that this doesn't really properly reserve mass to unknowns.
Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen)
c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of
Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
- Specified by:
score
in interface Lexicon
- Overrides:
score
in class BaseLexicon
- Parameters:
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initial
- Returns:
- A float score, usually - log P(word|tag)
Stanford NLP Group