edu.stanford.nlp.parser.lexparser
Class EnglishUnknownWordModel

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
      extended by edu.stanford.nlp.parser.lexparser.EnglishUnknownWordModel
All Implemented Interfaces:
UnknownWordModel, Serializable

public class EnglishUnknownWordModel
extends BaseUnknownWordModel

This is a basic unknown word model for English. It supports 5 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with ArabicUnknownWordModel and were originally included in BaseLexicon.

Author:
Dan Klein, Galen Andrew, Christopher Manning, Anna Rafferty
See Also:
Serialized Form

Field Summary
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected static short nullTag
           
protected static int nullWord
           
 ClassicCounter<IntTaggedWord> seenCounter
          Records the number of times word/tag pair was seen in training data.
protected  boolean smartMutation
           
protected  Set<IntTaggedWord> tags
          Set of all tags as IntTaggedWord.
protected  int unknownLevel
          What type of equivalence classing is done in getSignature
protected  int unknownPrefixSize
           
protected  int unknownSuffixSize
           
protected  ClassicCounter<IntTaggedWord> unSeenCounter
          Has counts for taggings in terms of unseen signatures.
protected  Set<IntTaggedWord> words
           
 
Constructor Summary
EnglishUnknownWordModel()
           
EnglishUnknownWordModel(Options.LexOptions op)
           
 
Method Summary
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 String getSignature(String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
 int getSignatureIndex(int wordIndex, int sentencePosition)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
protected  List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
           
protected  List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
           
 void readData(BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
 double score(IntTaggedWord iTW, int loc)
          Currently we don't consider loc in determining score.
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, boolean keepTagsAsLabels)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, double weight)
           
 void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
           
 void tune(Collection<Tree> trees)
           
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
getLexicon, getUnknownLevel, score, score, setLexicon, setUnknownLevel
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

smartMutation

protected boolean smartMutation

tags

protected transient Set<IntTaggedWord> tags
Set of all tags as IntTaggedWord. Alive in both train and runtime phases, but transient.


words

protected transient Set<IntTaggedWord> words

lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

unknownLevel

protected int unknownLevel
What type of equivalence classing is done in getSignature


unknownSuffixSize

protected int unknownSuffixSize

unknownPrefixSize

protected int unknownPrefixSize

seenCounter

public ClassicCounter<IntTaggedWord> seenCounter
Records the number of times word/tag pair was seen in training data. Includes word/tag pairs where one is a wildcard not a real word/tag.


unSeenCounter

protected ClassicCounter<IntTaggedWord> unSeenCounter
Has counts for taggings in terms of unseen signatures. The IntTagWords are for (tag,sig), (tag,null), (null,sig), (null,null). (None for basic UNK if there are signatures.)

Constructor Detail

EnglishUnknownWordModel

public EnglishUnknownWordModel()

EnglishUnknownWordModel

public EnglishUnknownWordModel(Options.LexOptions op)
Method Detail

train

public void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface UnknownWordModel
Overrides:
train in class BaseUnknownWordModel
Parameters:
trees - the collection of trees to be trained over

train

public void train(Collection<Tree> trees,
                  boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees.


train

public void train(Collection<Tree> trees,
                  double weight)

train

public void train(Collection<Tree> trees,
                  double weight,
                  boolean keepTagsAsLabels)

tune

public void tune(Collection<Tree> trees)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree,
                                           boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

listToEvents

protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)

listOfLabeledWordsToEvents

protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)

score

public double score(IntTaggedWord iTW,
                    int loc)
Description copied from class: BaseUnknownWordModel
Currently we don't consider loc in determining score.

Specified by:
score in interface UnknownWordModel
Overrides:
score in class BaseUnknownWordModel
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
Returns:
A double valued score, usually - log P(word|tag)

getSignatureIndex

public int getSignatureIndex(int wordIndex,
                             int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. Caches the last signature index returned.

Specified by:
getSignatureIndex in interface UnknownWordModel
Overrides:
getSignatureIndex in class BaseUnknownWordModel

getSignature

public String getSignature(String word,
                           int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention match the pattern UNK or UNK-.* , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.

Specified by:
getSignature in interface UnknownWordModel
Overrides:
getSignature in class BaseUnknownWordModel
Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

readData

public void readData(BufferedReader in)
              throws IOException
Populates data in this Lexicon from the character stream given by the Reader r.

Specified by:
readData in interface UnknownWordModel
Overrides:
readData in class BaseUnknownWordModel
Throws:
IOException

addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.



Stanford NLP Group