edu.stanford.nlp.parser.lexparser
Interface UnknownWordModel

All Superinterfaces:
Serializable
All Known Implementing Classes:
ArabicUnknownWordModel, BaseUnknownWordModel, ChineseUnknownWordModel, EnglishUnknownWordModel, FrenchUnknownWordModel, GermanUnknownWordModel

public interface UnknownWordModel
extends Serializable


Method Summary
 void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 Lexicon getLexicon()
          Returns the lexicon used by this unknown word model; lexicon is used to check information about words being seen/unseen
 String getSignature(String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
 int getSignatureIndex(int wordIndex, int sentencePosition, String word)
           
 int getUnknownLevel()
          Get the level of equivalence classing for the model.
 float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)
          Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
 void setUnknownLevel(int unknownLevel)
          One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class.
 void train(Collection<Tree> trees)
          Trains this unknown word model on the Collection of trees.
 Counter<IntTaggedWord> unSeenCounter()
           
 

Method Detail

setUnknownLevel

void setUnknownLevel(int unknownLevel)
One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be used to set the model one would like. Effects of the level will vary based on the implementing class. If a given class only includes one model, setting the unknown level should have no effect.

Parameters:
unknownLevel - Provides a choice between different unknown word processing schemes

getUnknownLevel

int getUnknownLevel()
Get the level of equivalence classing for the model.

Returns:
The current level of unknown word equivalence classing

getLexicon

Lexicon getLexicon()
Returns the lexicon used by this unknown word model; lexicon is used to check information about words being seen/unseen

Returns:
The lexicon used by this unknown word model

train

void train(Collection<Tree> trees)
Trains this unknown word model on the Collection of trees.

Parameters:
trees - The trees to train on

score

float score(IntTaggedWord iTW,
            int loc,
            double c_Tseen,
            double total,
            double smooth,
            String word)
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag), usually calculated as P(signature | tag).) Assumes the word is unknown.

Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
c_Tseen - Total count of this tag (on seen words) in training
total - Total count of word tokens in training
smooth - Weighting on prior P(T|U) in estimate
word - The word itself; useful so we don't look it up in the index
Returns:
A double valued score, usually - log P(word|tag)

scoreProbTagGivenWordSignature

double scoreProbTagGivenWordSignature(IntTaggedWord iTW,
                                      int loc,
                                      double smooth,
                                      String word)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)


getSignature

String getSignature(String word,
                    int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention match the pattern UNK or UNK-.* , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class.

Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

getSignatureIndex

int getSignatureIndex(int wordIndex,
                      int sentencePosition,
                      String word)

addTagging

void addTagging(boolean seen,
                IntTaggedWord itw,
                double count)
Adds the tagging with count to the data structures in this Lexicon.

Parameters:
seen - Whether tagging is seen
itw - The tagging
count - Its weight

unSeenCounter

Counter<IntTaggedWord> unSeenCounter()


Stanford NLP Group