public class ChineseMaxentLexicon extends java.lang.Object implements Lexicon
| Modifier and Type | Field and Description |
|---|---|
static boolean |
fixUnkFunctionWords |
static boolean |
seenTagsOnly |
CollectionValuedMap<java.lang.String,java.lang.String> |
tagsForWord |
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD| Constructor and Description |
|---|
ChineseMaxentLexicon(Options op,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex,
int featureLevel) |
| Modifier and Type | Method and Description |
|---|---|
void |
finishTraining()
Done collecting statistics for the lexicon.
|
UnknownWordModel |
getUnknownWordModel() |
void |
incrementTreesRead(double weight)
If training on a per-word basis instead of on a per-tree basis,
we will want to increment the tree count as this happens.
|
void |
initializeTraining(double numTrees)
Start training this lexicon on the expected number of trees.
|
boolean |
isKnown(int word)
Checks whether a word is in the lexicon.
|
boolean |
isKnown(java.lang.String word)
Checks whether a word is in the lexicon.
|
static void |
main(java.lang.String[] args) |
int |
numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon.
|
void |
readData(java.io.BufferedReader in)
Read the lexicon from the BufferedReader in the format written by
writeData.
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
java.lang.String featureSpec)
Get an iterator over all rules (pairs of (word, POS)) for this word.
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(java.lang.String word,
int loc,
java.lang.String featureSpec)
Same thing, but with a string that needs to be translated by the
lexicon's word index
|
float |
score(IntTaggedWord iTW,
int loc,
java.lang.String word,
java.lang.String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this
loc.
|
void |
setUnknownWordModel(UnknownWordModel uwm) |
java.util.Set<java.lang.String> |
tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
Return the Set of tags used by this tagger (available after training the tagger).
|
void |
train(java.util.Collection<Tree> trees)
Add the given collection of trees to the statistics counted.
|
void |
train(java.util.Collection<Tree> trees,
java.util.Collection<Tree> rawTrees) |
void |
train(java.util.Collection<Tree> trees,
double weight)
Add the given collection of trees to the statistics counted.
|
void |
train(java.util.List<TaggedWord> sentence,
double weight)
Add the given sentence to the statistics counted.
|
void |
train(TaggedWord tw,
int loc,
double weight)
Not all subclasses support this particular method.
|
void |
train(Tree tree,
double weight)
Add the given tree to the statistics counted.
|
void |
trainUnannotated(java.util.List<TaggedWord> sentence,
double weight)
Sometimes we might have a sentence of tagged words which we would
like to add to the lexicon, but they weren't part of a binarized,
markovized, or otherwise annotated tree.
|
void |
writeData(java.io.Writer w)
Write the lexicon in human-readable format to the Writer.
|
public static final boolean seenTagsOnly
public static final boolean fixUnkFunctionWords
public CollectionValuedMap<java.lang.String,java.lang.String> tagsForWord
public boolean isKnown(int word)
Lexiconpublic boolean isKnown(java.lang.String word)
Lexiconpublic java.util.Set<java.lang.String> tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)
LexiconruleIteratorByWord in interface Lexiconword - The word, represented as an integer in Indexloc - The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't
actually make use of this position information.featureSpec - Additional word features like morphosyntactic information.tag -> word rule.)public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)
LexiconruleIteratorByWord in interface Lexiconpublic int numRules()
public void initializeTraining(double numTrees)
LexiconinitializeTraining in interface Lexiconpublic final void train(java.util.Collection<Tree> trees)
public void train(java.util.Collection<Tree> trees, double weight)
public void train(Tree tree, double weight)
public void train(java.util.List<TaggedWord> sentence, double weight)
public void trainUnannotated(java.util.List<TaggedWord> sentence, double weight)
LexicontrainUnannotated in interface Lexiconpublic void incrementTreesRead(double weight)
LexiconincrementTreesRead in interface Lexiconpublic void train(TaggedWord tw, int loc, double weight)
Lexiconpublic void finishTraining()
LexiconfinishTraining in interface Lexiconpublic static void main(java.lang.String[] args)
public float score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)
Lexiconscore in interface LexiconiTW - An IntTaggedWord pairing a word and POS tagloc - The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial.word - The word itself; useful so we don't have to look it
up in an indexfeatureSpec - TODOpublic void writeData(java.io.Writer w)
throws java.io.IOException
Lexiconpublic void readData(java.io.BufferedReader in)
throws java.io.IOException
Lexiconpublic UnknownWordModel getUnknownWordModel()
getUnknownWordModel in interface Lexiconpublic void setUnknownWordModel(UnknownWordModel uwm)
setUnknownWordModel in interface Lexicon