edu.stanford.nlp.parser.lexparser
Class EnglishUnknownWordModel

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
      extended by edu.stanford.nlp.parser.lexparser.EnglishUnknownWordModel
All Implemented Interfaces:
UnknownWordModel, Serializable

public class EnglishUnknownWordModel
extends BaseUnknownWordModel

This is a basic unknown word model for English. It supports 5 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with ArabicUnknownWordModel and were originally included in BaseLexicon.

Author:
Dan Klein, Galen Andrew, Christopher Manning, Anna Rafferty
See Also:
Serialized Form

Field Summary
protected  boolean smartMutation
           
protected  int unknownPrefixSize
           
protected  int unknownSuffixSize
           
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex
 
Constructor Summary
EnglishUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)
           
 
Method Summary
 String getSignature(String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
 int getSignatureIndex(int index, int sentencePosition, String word)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
protected  List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
           
protected  List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
           
 float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)
          Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
 double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)
          Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, boolean keepTagsAsLabels)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, double weight)
           
 void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
           
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, getUnknownLevel, score, scoreGT, setUnknownLevel, trainUnknownGT, unSeenCounter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

smartMutation

protected final boolean smartMutation

unknownSuffixSize

protected final int unknownSuffixSize

unknownPrefixSize

protected final int unknownPrefixSize
Constructor Detail

EnglishUnknownWordModel

public EnglishUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<String> wordIndex,
                               Index<String> tagIndex)
Method Detail

train

public void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface UnknownWordModel
Overrides:
train in class BaseUnknownWordModel
Parameters:
trees - the collection of trees to be trained over

train

public void train(Collection<Tree> trees,
                  boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees.


train

public void train(Collection<Tree> trees,
                  double weight)

train

public void train(Collection<Tree> trees,
                  double weight,
                  boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree,
                                           boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

listToEvents

protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)

listOfLabeledWordsToEvents

protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)

score

public float score(IntTaggedWord iTW,
                   int loc,
                   double c_Tseen,
                   double total,
                   double smooth,
                   String word)
Description copied from class: BaseUnknownWordModel
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Specified by:
score in interface UnknownWordModel
Overrides:
score in class BaseUnknownWordModel
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
c_Tseen - Total count of this tag (on seen words) in training
total - Total count of word tokens in training
smooth - Weighting on prior P(T|U) in estimate
word - The word itself; useful so we don't look it up in the index
Returns:
A double valued score, usually - log P(word|tag)

scoreProbTagGivenWordSignature

public double scoreProbTagGivenWordSignature(IntTaggedWord iTW,
                                             int loc,
                                             double smooth,
                                             String word)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)

Specified by:
scoreProbTagGivenWordSignature in interface UnknownWordModel
Overrides:
scoreProbTagGivenWordSignature in class BaseUnknownWordModel

getSignatureIndex

public int getSignatureIndex(int index,
                             int sentencePosition,
                             String word)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.

Specified by:
getSignatureIndex in interface UnknownWordModel
Overrides:
getSignatureIndex in class BaseUnknownWordModel

getSignature

public String getSignature(String word,
                           int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention matches the pattern UNK(-.+)? , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.

Specified by:
getSignature in interface UnknownWordModel
Overrides:
getSignature in class BaseUnknownWordModel
Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)


Stanford NLP Group