public class FactoredLexicon extends BaseLexicon
DEBUG_LEXICON, DEBUG_LEXICON_SCORE, flexiTag, NULL_ITW, nullTag, nullWord, op, rulesWithWord, seenCounter, smartMutation, smoothInUnknownsThreshold, tagIndex, tags, testOptions, trainOptions, useSignatureForKnownSmoothing, uwModel, uwModelTrainer, uwModelTrainerClass, wordIndex, words
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
Constructor and Description |
---|
FactoredLexicon(MorphoFeatureSpecification morphoSpec,
Index<String> wordIndex,
Index<String> tagIndex) |
FactoredLexicon(Options op,
MorphoFeatureSpecification morphoSpec,
Index<String> wordIndex,
Index<String> tagIndex) |
Modifier and Type | Method and Description |
---|---|
protected void |
initRulesWithWord()
Rule table is lemmas!
|
static void |
main(String[] args) |
Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
String featureSpec)
Rule table is lemmas.
|
float |
score(IntTaggedWord iTW,
int loc,
String word,
String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this
location.
|
void |
train(Collection<Tree> trees,
Collection<Tree> rawTrees)
This method should populate wordIndex, tagIndex, and morphIndex.
|
addAll, addAll, addTagging, evaluateCoverage, examineIntersection, finishTraining, getBaseTag, getUnknownWordModel, incrementTreesRead, initializeTraining, isKnown, isKnown, listToEvents, numRules, printLexStats, readData, ruleIteratorByWord, ruleIteratorByWord, setUnknownWordModel, tagSet, train, train, train, train, train, trainUnannotated, trainWithExpansion, treeToEvents, tune, writeData
public FactoredLexicon(MorphoFeatureSpecification morphoSpec, Index<String> wordIndex, Index<String> tagIndex)
public Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
ruleIteratorByWord
in interface Lexicon
ruleIteratorByWord
in class BaseLexicon
word
- The word (as an int)loc
- Its index in the sentence (usually only relevant for unknown words)featureSpec
- Additional word features like morphosyntactic information.public float score(IntTaggedWord iTW, int loc, String word, String featureSpec)
BaseLexicon
Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
score
in interface Lexicon
score
in class BaseLexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initialword
- The word itself; useful so we don't have to look it
up in an indexfeatureSpec
- TODOpublic void train(Collection<Tree> trees, Collection<Tree> rawTrees)
train
in interface Lexicon
train
in class BaseLexicon
protected void initRulesWithWord()
initRulesWithWord
in class BaseLexicon
public static void main(String[] args)
args
-