public class BaseLexicon extends java.lang.Object implements Lexicon
Modifier and Type | Field and Description |
---|---|
protected static boolean |
DEBUG_LEXICON |
protected static boolean |
DEBUG_LEXICON_SCORE |
protected boolean |
flexiTag |
protected static IntTaggedWord |
NULL_ITW |
protected static short |
nullTag |
protected static int |
nullWord |
protected Options |
op |
java.util.List<IntTaggedWord>[] |
rulesWithWord
An array of Lists of rules (IntTaggedWord), indexed by word.
|
ClassicCounter<IntTaggedWord> |
seenCounter
Records the number of times word/tag pair was seen in training data.
|
protected boolean |
smartMutation
Have tags changeable based on statistics on word types having various
taggings.
|
protected int |
smoothInUnknownsThreshold
If a word has been seen more than this many times, then relative
frequencies of tags are used for POS assignment; if not, they are smoothed
with tag priors.
|
protected Index<java.lang.String> |
tagIndex |
protected java.util.Set<IntTaggedWord> |
tags
Set of all tags as IntTaggedWord.
|
protected TestOptions |
testOptions |
protected TrainOptions |
trainOptions |
protected boolean |
useSignatureForKnownSmoothing |
protected UnknownWordModel |
uwModel |
protected UnknownWordModelTrainer |
uwModelTrainer |
protected java.lang.String |
uwModelTrainerClass |
protected Index<java.lang.String> |
wordIndex |
protected java.util.Set<IntTaggedWord> |
words |
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
Constructor and Description |
---|
BaseLexicon(Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex) |
BaseLexicon(Options op,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex) |
Modifier and Type | Method and Description |
---|---|
void |
addAll(java.util.List<TaggedWord> tagWords)
Not yet implemented.
|
void |
addAll(java.util.List<TaggedWord> taggedWords,
double weight)
Not yet implemented.
|
protected void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon.
|
double |
evaluateCoverage(java.util.Collection<Tree> trees,
java.util.Set<java.lang.String> missingWords,
java.util.Set<java.lang.String> missingTags,
java.util.Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are
covered by the lexicon.
|
protected static void |
examineIntersection(java.util.Set<java.lang.String> s1,
java.util.Set<java.lang.String> s2) |
void |
finishTraining()
Done collecting statistics for the lexicon.
|
int |
getBaseTag(int tag,
TreebankLanguagePack tlp) |
UnknownWordModel |
getUnknownWordModel() |
void |
incrementTreesRead(double weight)
If training on a per-word basis instead of on a per-tree basis,
we will want to increment the tree count as this happens.
|
void |
initializeTraining(double numTrees)
Start training this lexicon on the expected number of trees.
|
protected void |
initRulesWithWord() |
boolean |
isKnown(int word)
Checks whether a word is in the lexicon.
|
boolean |
isKnown(java.lang.String word)
Checks whether a word is in the lexicon.
|
protected java.util.List<IntTaggedWord> |
listToEvents(java.util.List<TaggedWord> taggedWords) |
static void |
main(java.lang.String[] args)
Provides some testing and opportunities for exploration of the
probabilities of a BaseLexicon.
|
int |
numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon.
|
void |
printLexStats()
Print some statistics about this lexicon.
|
void |
readData(java.io.BufferedReader in)
Populates data in this Lexicon from the character stream given by the
Reader r.
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
java.lang.String featureSpec)
Generate the possible taggings for a word at a sentence position.
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(java.lang.String word,
int loc)
Returns the possible POS taggings for a word.
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(java.lang.String word,
int loc,
java.lang.String featureSpec)
Same thing, but with a string that needs to be translated by the
lexicon's word index
|
float |
score(IntTaggedWord iTW,
int loc,
java.lang.String word,
java.lang.String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this
location.
|
void |
setUnknownWordModel(UnknownWordModel uwm) |
java.util.Set<java.lang.String> |
tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
Return the Set of tags used by this tagger (available after training the tagger).
|
void |
train(java.util.Collection<Tree> trees)
Trains this lexicon on the Collection of trees.
|
void |
train(java.util.Collection<Tree> trees,
java.util.Collection<Tree> rawTrees) |
void |
train(java.util.Collection<Tree> trees,
double weight)
Trains this lexicon on the Collection of trees.
|
void |
train(java.util.List<TaggedWord> sentence,
double weight)
Not all subclasses support this particular method.
|
void |
train(TaggedWord tw,
int loc,
double weight)
Not all subclasses support this particular method.
|
void |
train(Tree tree,
double weight) |
void |
trainUnannotated(java.util.List<TaggedWord> sentence,
double weight)
Sometimes we might have a sentence of tagged words which we would
like to add to the lexicon, but they weren't part of a binarized,
markovized, or otherwise annotated tree.
|
void |
trainWithExpansion(java.util.Collection<TaggedWord> taggedWords)
Not yet implemented.
|
protected java.util.List<IntTaggedWord> |
treeToEvents(Tree tree) |
void |
tune()
TODO: this used to actually score things based on the original trees
|
void |
writeData(java.io.Writer w)
Writes out data from this Object to the Writer w.
|
protected UnknownWordModel uwModel
protected final java.lang.String uwModelTrainerClass
protected transient UnknownWordModelTrainer uwModelTrainer
protected static final boolean DEBUG_LEXICON
protected static final boolean DEBUG_LEXICON_SCORE
protected static final int nullWord
protected static final short nullTag
protected static final IntTaggedWord NULL_ITW
protected final TrainOptions trainOptions
protected final TestOptions testOptions
protected final Options op
protected int smoothInUnknownsThreshold
protected boolean smartMutation
protected final Index<java.lang.String> wordIndex
protected final Index<java.lang.String> tagIndex
public transient java.util.List<IntTaggedWord>[] rulesWithWord
protected transient java.util.Set<IntTaggedWord> tags
protected transient java.util.Set<IntTaggedWord> words
public ClassicCounter<IntTaggedWord> seenCounter
protected boolean flexiTag
protected boolean useSignatureForKnownSmoothing
public boolean isKnown(int word)
public boolean isKnown(java.lang.String word)
public java.util.Set<java.lang.String> tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word, int loc)
word
- The word, represented as an integer in wordIndexloc
- The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't actually
make use of this position information.tag -> word rule.)
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)
Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).
ruleIteratorByWord
in interface Lexicon
word
- The word (as an int)loc
- Its index in the sentence (usually only relevant for unknown words)featureSpec
- Additional word features like morphosyntactic information.public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)
Lexicon
ruleIteratorByWord
in interface Lexicon
protected void initRulesWithWord()
protected java.util.List<IntTaggedWord> treeToEvents(Tree tree)
protected java.util.List<IntTaggedWord> listToEvents(java.util.List<TaggedWord> taggedWords)
public void addAll(java.util.List<TaggedWord> tagWords)
public void addAll(java.util.List<TaggedWord> taggedWords, double weight)
public void trainWithExpansion(java.util.Collection<TaggedWord> taggedWords)
public void initializeTraining(double numTrees)
Lexicon
initializeTraining
in interface Lexicon
public void train(java.util.Collection<Tree> trees)
public void train(java.util.Collection<Tree> trees, double weight)
public final void train(java.util.List<TaggedWord> sentence, double weight)
Lexicon
public final void incrementTreesRead(double weight)
Lexicon
incrementTreesRead
in interface Lexicon
public final void trainUnannotated(java.util.List<TaggedWord> sentence, double weight)
Lexicon
trainUnannotated
in interface Lexicon
public void train(TaggedWord tw, int loc, double weight)
Lexicon
public void finishTraining()
Lexicon
finishTraining
in interface Lexicon
protected void addTagging(boolean seen, IntTaggedWord itw, double count)
public float score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)
Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
score
in interface Lexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initialword
- The word itself; useful so we don't have to look it
up in an indexfeatureSpec
- TODOpublic final void tune()
public void readData(java.io.BufferedReader in) throws java.io.IOException
public void writeData(java.io.Writer w) throws java.io.IOException
public int numRules()
protected static void examineIntersection(java.util.Set<java.lang.String> s1, java.util.Set<java.lang.String> s2)
public void printLexStats()
public double evaluateCoverage(java.util.Collection<Tree> trees, java.util.Set<java.lang.String> missingWords, java.util.Set<java.lang.String> missingTags, java.util.Set<IntTaggedWord> missingTW)
public int getBaseTag(int tag, TreebankLanguagePack tlp)
public static void main(java.lang.String[] args)
args
- The command line arguments:
java BaseLexicon treebankPath fileRange unknownWordModel words*public UnknownWordModel getUnknownWordModel()
getUnknownWordModel
in interface Lexicon
public final void setUnknownWordModel(UnknownWordModel uwm)
setUnknownWordModel
in interface Lexicon