edu.stanford.nlp.parser.lexparser
Class BaseUnknownWordModel

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
All Implemented Interfaces:
UnknownWordModel, java.io.Serializable

public class BaseUnknownWordModel
extends java.lang.Object
implements UnknownWordModel

An unknown word model for a generic language. This was originally designed for German, changing only to remove German-specific numeric features. Models unknown words based on their prefix and suffixes, as well as capital letters.

Author:
Roger Levy, Greg Donaker (corrections and modeling improvements), Christopher Manning (generalized and improved what Greg did), Anna Rafferty
See Also:
Serialized Form

Field Summary
protected static short nullTag
           
protected static int nullWord
           
protected  java.util.HashMap<Label,ClassicCounter<java.lang.String>> tagHash
          This maps from a tag (as a label) to a Counter from word signatures to their P(sig|tag), as estimated in the model.
protected static java.lang.String unknown
           
protected  int unknownLevel
          What type of equivalence classing is done in getSignature
protected  ClassicCounter<IntTaggedWord> unSeenCounter
          Has counts for taggings in terms of unseen signatures.
protected  boolean useFirst
           
protected  boolean useGT
           
protected static boolean VERBOSE
           
 
Constructor Summary
BaseUnknownWordModel(Options.LexOptions op, Lexicon lex)
           
 
Method Summary
 void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 Lexicon getLexicon()
          Get the lexicon associated with this unknown word model; usually not used, but might be useful to tell you if a related word is known or unknown, for example.
 java.lang.String getSignature(java.lang.String word, int loc)
          Signature for a specific word; loc parameter is ignored.
 int getSignatureIndex(int wordIndex, int sentencePosition)
           
 int getUnknownLevel()
          Get the level of equivalence classing for the model.
 float score(IntTaggedWord itw)
           
 float score(IntTaggedWord itw, int loc, double c_Tseen, double total, double smooth)
          Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
protected  float scoreGT(Label tag)
           
 double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth)
          Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
 void setUnknownLevel(int unknownLevel)
          One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class.
 void train(java.util.Collection<Tree> trees)
          trains the end-character based unknown word model.
protected  void trainUnknownGT(java.util.Collection<Tree> trees)
          Trains Good-Turing estimation of unknown words.
 Counter<IntTaggedWord> unSeenCounter()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERBOSE

protected static final boolean VERBOSE
See Also:
Constant Field Values

useFirst

protected boolean useFirst

useGT

protected boolean useGT

unknownLevel

protected int unknownLevel
What type of equivalence classing is done in getSignature


unknown

protected static final java.lang.String unknown
See Also:
Constant Field Values

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

unSeenCounter

protected ClassicCounter<IntTaggedWord> unSeenCounter
Has counts for taggings in terms of unseen signatures. The IntTagWords are for (tag,sig), (tag,null), (null,sig), (null,null). (None for basic UNK if there are signatures.)


tagHash

protected java.util.HashMap<Label,ClassicCounter<java.lang.String>> tagHash
This maps from a tag (as a label) to a Counter from word signatures to their P(sig|tag), as estimated in the model. For Chinese, the word signature is just the first character or its unicode type for things that aren't Chinese characters.

Constructor Detail

BaseUnknownWordModel

public BaseUnknownWordModel(Options.LexOptions op,
                            Lexicon lex)
Method Detail

score

public float score(IntTaggedWord itw,
                   int loc,
                   double c_Tseen,
                   double total,
                   double smooth)
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Specified by:
score in interface UnknownWordModel
Parameters:
itw - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
c_Tseen - Total count of this tag (on seen words) in training
total - Total count of word tokens in training
smooth - Weighting on prior P(T|U) in estimate
Returns:
A double valued score, usually - log P(word|tag)

score

public float score(IntTaggedWord itw)

scoreProbTagGivenWordSignature

public double scoreProbTagGivenWordSignature(IntTaggedWord iTW,
                                             int loc,
                                             double smooth)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)

Specified by:
scoreProbTagGivenWordSignature in interface UnknownWordModel

scoreGT

protected float scoreGT(Label tag)

getSignature

public java.lang.String getSignature(java.lang.String word,
                                     int loc)
Signature for a specific word; loc parameter is ignored.

Specified by:
getSignature in interface UnknownWordModel
Parameters:
word - The word
loc - Its sentence position
Returns:
A "signature" (which represents an equivalence class of Strings), e.g., a suffix of the string

getSignatureIndex

public int getSignatureIndex(int wordIndex,
                             int sentencePosition)
Specified by:
getSignatureIndex in interface UnknownWordModel

train

public void train(java.util.Collection<Tree> trees)
trains the end-character based unknown word model.

Specified by:
train in interface UnknownWordModel
Parameters:
trees - the collection of trees to be trained over

trainUnknownGT

protected void trainUnknownGT(java.util.Collection<Tree> trees)
Trains Good-Turing estimation of unknown words.

Parameters:
trees - Trees to train model from

getLexicon

public Lexicon getLexicon()
Get the lexicon associated with this unknown word model; usually not used, but might be useful to tell you if a related word is known or unknown, for example.

Specified by:
getLexicon in interface UnknownWordModel
Returns:
The lexicon used by this unknown word model

getUnknownLevel

public int getUnknownLevel()
Description copied from interface: UnknownWordModel
Get the level of equivalence classing for the model.

Specified by:
getUnknownLevel in interface UnknownWordModel
Returns:
The current level of unknown word equivalence classing

setUnknownLevel

public void setUnknownLevel(int unknownLevel)
Description copied from interface: UnknownWordModel
One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be used to set the model one would like. Effects of the level will vary based on the implementing class. If a given class only includes one model, setting the unknown level should have no effect.

Specified by:
setUnknownLevel in interface UnknownWordModel
Parameters:
unknownLevel - Provides a choice between different unknown word processing schemes

addTagging

public void addTagging(boolean seen,
                       IntTaggedWord itw,
                       double count)
Adds the tagging with count to the data structures in this Lexicon.

Specified by:
addTagging in interface UnknownWordModel
Parameters:
seen - Whether tagging is seen
itw - The tagging
count - Its weight

unSeenCounter

public Counter<IntTaggedWord> unSeenCounter()
Specified by:
unSeenCounter in interface UnknownWordModel


Stanford NLP Group