|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
public class BaseUnknownWordModel
An unknown word model for a generic language. This was originally designed for German, changing only to remove German-specific numeric features. Models unknown words based on their prefix and suffixes, as well as capital letters.
Field Summary | |
---|---|
protected static IntTaggedWord |
NULL_ITW
|
protected static short |
nullTag
|
protected static int |
nullWord
|
protected HashMap<Label,ClassicCounter<String>> |
tagHash
This maps from a tag (as a label) to a Counter from word signatures to their P(sig|tag), as estimated in the model. |
protected Index<String> |
tagIndex
|
protected TrainOptions |
trainOptions
|
protected static String |
unknown
|
protected int |
unknownLevel
What type of equivalence classing is done in getSignature |
protected ClassicCounter<IntTaggedWord> |
unSeenCounter
Has counts for taggings in terms of unseen signatures. |
protected boolean |
useFirst
|
protected boolean |
useGT
|
protected static boolean |
VERBOSE
|
protected Index<String> |
wordIndex
|
Constructor Summary | |
---|---|
BaseUnknownWordModel(Options op,
Lexicon lex,
Index<String> wordIndex,
Index<String> tagIndex)
|
Method Summary | |
---|---|
void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon. |
Lexicon |
getLexicon()
Get the lexicon associated with this unknown word model; usually not used, but might be useful to tell you if a related word is known or unknown, for example. |
String |
getSignature(String word,
int loc)
Signature for a specific word; loc parameter is ignored. |
int |
getSignatureIndex(int wordIndex,
int sentencePosition,
String word)
|
int |
getUnknownLevel()
Get the level of equivalence classing for the model. |
float |
score(IntTaggedWord itw,
int loc,
double c_Tseen,
double total,
double smooth,
String word)
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them. |
float |
score(IntTaggedWord itw,
String word)
|
protected float |
scoreGT(Label tag)
|
double |
scoreProbTagGivenWordSignature(IntTaggedWord iTW,
int loc,
double smooth,
String word)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown) |
void |
setUnknownLevel(int unknownLevel)
One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. |
void |
train(Collection<Tree> trees)
trains the end-character based unknown word model. |
protected void |
trainUnknownGT(Collection<Tree> trees)
Trains Good-Turing estimation of unknown words. |
Counter<IntTaggedWord> |
unSeenCounter()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final boolean VERBOSE
protected boolean useFirst
protected boolean useGT
protected int unknownLevel
protected static final String unknown
protected static final int nullWord
protected static final short nullTag
protected static final IntTaggedWord NULL_ITW
protected final TrainOptions trainOptions
protected final Index<String> wordIndex
protected final Index<String> tagIndex
protected ClassicCounter<IntTaggedWord> unSeenCounter
protected HashMap<Label,ClassicCounter<String>> tagHash
Constructor Detail |
---|
public BaseUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)
Method Detail |
---|
public float score(IntTaggedWord itw, int loc, double c_Tseen, double total, double smooth, String word)
score
in interface UnknownWordModel
itw
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial. Now,
a negative value c_Tseen
- Total count of this tag (on seen words) in trainingtotal
- Total count of word tokens in trainingsmooth
- Weighting on prior P(T|U) in estimateword
- The word itself; useful so we don't look it up in the index
public float score(IntTaggedWord itw, String word)
public double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)
scoreProbTagGivenWordSignature
in interface UnknownWordModel
protected float scoreGT(Label tag)
public String getSignature(String word, int loc)
getSignature
in interface UnknownWordModel
word
- The wordloc
- Its sentence position
public int getSignatureIndex(int wordIndex, int sentencePosition, String word)
getSignatureIndex
in interface UnknownWordModel
public void train(Collection<Tree> trees)
train
in interface UnknownWordModel
trees
- the collection of trees to be trained overprotected void trainUnknownGT(Collection<Tree> trees)
trees
- Trees to train model frompublic Lexicon getLexicon()
getLexicon
in interface UnknownWordModel
public int getUnknownLevel()
UnknownWordModel
getUnknownLevel
in interface UnknownWordModel
public void setUnknownLevel(int unknownLevel)
UnknownWordModel
setUnknownLevel
in interface UnknownWordModel
unknownLevel
- Provides a choice between different unknown word
processing schemespublic void addTagging(boolean seen, IntTaggedWord itw, double count)
addTagging
in interface UnknownWordModel
seen
- Whether tagging is seenitw
- The taggingcount
- Its weightpublic Counter<IntTaggedWord> unSeenCounter()
unSeenCounter
in interface UnknownWordModel
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |