EnglishUnknownWordModel (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
- - edu.stanford.nlp.parser.lexparser.EnglishUnknownWordModel

All Implemented Interfaces:

UnknownWordModel, java.io.Serializable
```
public class EnglishUnknownWordModel
extends BaseUnknownWordModel
```
This is a basic unknown word model for English. It supports 5 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with ArabicUnknownWordModel and were originally included in BaseLexicon.

Author:

Dan Klein, Galen Andrew, Christopher Manning, Anna Rafferty

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`smartMutation`
`protected int`	`unknownPrefixSize`
`protected int`	`unknownSuffixSize`
`protected java.lang.String`	`wordClassesFile`

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex

Constructor Summary

Constructors
Constructor and Description
`EnglishUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)` This constructor creates an UWM with empty data structures.
`EnglishUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.lang.String`	`getSignature(java.lang.String word, int loc)` This routine returns a String that is the "signature" of the class of a word.
`int`	`getSignatureIndex(int index, int sentencePosition, java.lang.String word)` Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
`float`	`score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, java.lang.String word)` Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
`double`	`scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, java.lang.String word)` Calculate P(Tag\|Signature) with Bayesian smoothing via just P(Tag\|Unknown)

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, getUnknownLevel, score, scoreGT, unSeenCounter

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - smartMutation
```
protected final boolean smartMutation
```
  - unknownSuffixSize
```
protected final int unknownSuffixSize
```
  - unknownPrefixSize
```
protected final int unknownPrefixSize
```
  - wordClassesFile
```
protected final java.lang.String wordClassesFile
```
- Constructor Detail
  - EnglishUnknownWordModel
```
public EnglishUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<java.lang.String> wordIndex,
                               Index<java.lang.String> tagIndex,
                               ClassicCounter<IntTaggedWord> unSeenCounter)
```
  - EnglishUnknownWordModel
```
public EnglishUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<java.lang.String> wordIndex,
                               Index<java.lang.String> tagIndex)
```
    This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.
- Method Detail
  - score
```
public float score(IntTaggedWord iTW,
                   int loc,
                   double c_Tseen,
                   double total,
                   double smooth,
                   java.lang.String word)
```
    Description copied from class: BaseUnknownWordModel
    
    Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
    
    Specified by:
    
    score in interface UnknownWordModel
    
    Overrides:
    
    score in class BaseUnknownWordModel
    
    Parameters:
    
    iTW - An IntTaggedWord pairing a word and POS tag
    
    loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
    
    c_Tseen - Total count of this tag (on seen words) in training
    
    total - Total count of word tokens in training
    
    smooth - Weighting on prior P(T|U) in estimate
    
    word - The word itself; useful so we don't look it up in the index
    
    Returns:
    
    A double valued score, usually - log P(word|tag)
  - scoreProbTagGivenWordSignature
```
public double scoreProbTagGivenWordSignature(IntTaggedWord iTW,
                                             int loc,
                                             double smooth,
                                             java.lang.String word)
```
    Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
    
    Specified by:
    
    scoreProbTagGivenWordSignature in interface UnknownWordModel
    
    Overrides:
    
    scoreProbTagGivenWordSignature in class BaseUnknownWordModel
  - getSignatureIndex
```
public int getSignatureIndex(int index,
                             int sentencePosition,
                             java.lang.String word)
```
    Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
    
    Specified by:
    
    getSignatureIndex in interface UnknownWordModel
    
    Overrides:
    
    getSignatureIndex in class BaseUnknownWordModel
  - getSignature
```
public java.lang.String getSignature(java.lang.String word,
                                     int loc)
```
    This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention matches the pattern UNK(-.+)? , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.
    
    Specified by:
    
    getSignature in interface UnknownWordModel
    
    Overrides:
    
    getSignature in class BaseUnknownWordModel
    
    Parameters:
    
    word - The word to make a signature for
    
    loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
    
    Returns:
    
    A String that is its signature (equivalence class)

Class EnglishUnknownWordModel

Field Summary

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Methods inherited from class java.lang.Object

Field Detail

smartMutation

unknownSuffixSize

unknownPrefixSize

wordClassesFile

Constructor Detail

EnglishUnknownWordModel

EnglishUnknownWordModel

Method Detail

score

scoreProbTagGivenWordSignature

getSignatureIndex

getSignature