public class ChineseCharacterBasedLexicon extends Object implements Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
Constructor and Description |
---|
ChineseCharacterBasedLexicon(ChineseTreebankParserParams params,
Index<String> wordIndex,
Index<String> tagIndex) |
Modifier and Type | Method and Description |
---|---|
void |
finishTraining()
Done collecting statistics for the lexicon.
|
Distribution<String> |
getPOSDistribution() |
UnknownWordModel |
getUnknownWordModel() |
void |
incrementTreesRead(double weight)
If training on a per-word basis instead of on a per-tree basis,
we will want to increment the tree count as this happens.
|
void |
initializeTraining(double numTrees)
Start training this lexicon on the expected number of trees.
|
static boolean |
isForeign(String s) |
boolean |
isKnown(int word)
Checks whether a word is in the lexicon.
|
boolean |
isKnown(String word)
Checks whether a word is in the lexicon.
|
int |
numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon.
|
void |
readData(BufferedReader in)
Read the lexicon from the BufferedReader in the format written by
writeData.
|
Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
String featureSpec)
Get an iterator over all rules (pairs of (word, POS)) for this word.
|
Iterator<IntTaggedWord> |
ruleIteratorByWord(String word,
int loc,
String featureSpec)
Same thing, but with a string that needs to be translated by the
lexicon's word index
|
String |
sampleFrom()
Samples over words regardless of POS: first samples POS, then samples
word according to that POS
|
String |
sampleFrom(String tag)
Samples from the distribution over words with this POS according to the lexicon.
|
float |
score(IntTaggedWord iTW,
int loc,
String word,
String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this
loc.
|
void |
setUnknownWordModel(UnknownWordModel uwm) |
Set<String> |
tagSet(java.util.function.Function<String,String> basicCategoryFunction)
Return the Set of tags used by this tagger (available after training the tagger).
|
void |
train(Collection<Tree> trees)
Train this lexicon on the given set of trees.
|
void |
train(Collection<Tree> trees,
Collection<Tree> rawTrees) |
void |
train(Collection<Tree> trees,
double weight)
Train this lexicon on the given set of trees.
|
void |
train(List<TaggedWord> sentence,
double weight)
Not all subclasses support this particular method.
|
void |
train(TaggedWord tw,
int loc,
double weight)
Not all subclasses support this particular method.
|
void |
train(Tree tree,
double weight)
TODO: make this method do something with the weight
|
void |
trainUnannotated(List<TaggedWord> sentence,
double weight)
Sometimes we might have a sentence of tagged words which we would
like to add to the lexicon, but they weren't part of a binarized,
markovized, or otherwise annotated tree.
|
void |
writeData(Writer w)
Write the lexicon in human-readable format to the Writer.
|
public ChineseCharacterBasedLexicon(ChineseTreebankParserParams params, Index<String> wordIndex, Index<String> tagIndex)
public void initializeTraining(double numTrees)
Lexicon
initializeTraining
in interface Lexicon
public void train(Collection<Tree> trees)
public void train(Collection<Tree> trees, double weight)
public void train(Tree tree, double weight)
public void trainUnannotated(List<TaggedWord> sentence, double weight)
Lexicon
trainUnannotated
in interface Lexicon
public void incrementTreesRead(double weight)
Lexicon
incrementTreesRead
in interface Lexicon
public void train(TaggedWord tw, int loc, double weight)
Lexicon
public void train(List<TaggedWord> sentence, double weight)
Lexicon
public void finishTraining()
Lexicon
finishTraining
in interface Lexicon
public Distribution<String> getPOSDistribution()
public static boolean isForeign(String s)
public float score(IntTaggedWord iTW, int loc, String word, String featureSpec)
Lexicon
score
in interface Lexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial.word
- The word itself; useful so we don't have to look it
up in an indexfeatureSpec
- TODOpublic String sampleFrom(String tag)
tag
- the POS of the word to samplepublic String sampleFrom()
public Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
Lexicon
ruleIteratorByWord
in interface Lexicon
word
- The word, represented as an integer in Indexloc
- The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't
actually make use of this position information.featureSpec
- Additional word features like morphosyntactic information.tag -> word rule.)
public Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc, String featureSpec)
Lexicon
ruleIteratorByWord
in interface Lexicon
public int numRules()
public void readData(BufferedReader in) throws IOException
Lexicon
readData
in interface Lexicon
in
- The BufferedReader to read fromIOException
- If any I/O problempublic void writeData(Writer w) throws IOException
Lexicon
writeData
in interface Lexicon
w
- The writer to output toIOException
- If any I/O problempublic boolean isKnown(int word)
Lexicon
public boolean isKnown(String word)
Lexicon
public Set<String> tagSet(java.util.function.Function<String,String> basicCategoryFunction)
public UnknownWordModel getUnknownWordModel()
getUnknownWordModel
in interface Lexicon
public void setUnknownWordModel(UnknownWordModel uwm)
setUnknownWordModel
in interface Lexicon
public void train(Collection<Tree> trees, Collection<Tree> rawTrees)