EnglishUnknownWordModel (Stanford CoreNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
- - edu.stanford.nlp.parser.lexparser.EnglishUnknownWordModel

All Implemented Interfaces:

UnknownWordModel, Serializable
```
public class EnglishUnknownWordModel
extends BaseUnknownWordModel
```
This is a basic unknown word model for English. It supports 5 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with ArabicUnknownWordModel and were originally included in BaseLexicon.

Author:

Dan Klein, Galen Andrew, Christopher Manning, Anna Rafferty

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

protected boolean smartMutation

protected int unknownPrefixSize

protected int unknownSuffixSize

protected String wordClassesFile

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex

Constructor Summary

Constructors
Constructor and Description

EnglishUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)
This constructor creates an UWM with empty data structures.

EnglishUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type Method and Description

String getSignature(String word, int loc)
This routine returns a String that is the "signature" of the class of a word.

int getSignatureIndex(int index, int sentencePosition, String word)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.

float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, getUnknownLevel, score, scoreGT, unSeenCounter

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

smartMutation

protected final boolean smartMutation

unknownSuffixSize

protected final int unknownSuffixSize

unknownPrefixSize

protected final int unknownPrefixSize

wordClassesFile

protected final String wordClassesFile

Constructor Detail

EnglishUnknownWordModel

public EnglishUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)

EnglishUnknownWordModel

public EnglishUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)

This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.

Method Detail

score

public float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)

Description copied from class: BaseUnknownWordModel

Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Specified by:

score in interface UnknownWordModel

Overrides:

score in class BaseUnknownWordModel

Parameters:

iTW - An IntTaggedWord pairing a word and POS tag

loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value

c_Tseen - Total count of this tag (on seen words) in training

total - Total count of word tokens in training

smooth - Weighting on prior P(T|U) in estimate

word - The word itself; useful so we don't look it up in the index

Returns:

A double valued score, usually - log P(word|tag)

scoreProbTagGivenWordSignature

public double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)

Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)

Specified by:

scoreProbTagGivenWordSignature in interface UnknownWordModel

Overrides:

scoreProbTagGivenWordSignature in class BaseUnknownWordModel

getSignatureIndex

public int getSignatureIndex(int index, int sentencePosition, String word)

Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.

Specified by:

getSignatureIndex in interface UnknownWordModel

Overrides:

getSignatureIndex in class BaseUnknownWordModel

getSignature

public String getSignature(String word, int loc)

This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention matches the pattern UNK(-.+)? , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.

Specified by:

getSignature in interface UnknownWordModel

Overrides:

getSignature in class BaseUnknownWordModel

Parameters:

word - The word to make a signature for

loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)

Returns:

A String that is its signature (equivalence class)

Modifier and Type	Field and Description
`protected boolean`	`smartMutation`
`protected int`	`unknownPrefixSize`
`protected int`	`unknownSuffixSize`
`protected String`	`wordClassesFile`

Modifier and Type	Method and Description
`String`	`getSignature(String word, int loc)` This routine returns a String that is the "signature" of the class of a word.
`int`	`getSignatureIndex(int index, int sentencePosition, String word)` Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
`float`	`score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)` Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
`double`	`scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, String word)` Calculate P(Tag\|Signature) with Bayesian smoothing via just P(Tag\|Unknown)

Class EnglishUnknownWordModel

Field Summary

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Methods inherited from class java.lang.Object

Field Detail

smartMutation

unknownSuffixSize

unknownPrefixSize

wordClassesFile

Constructor Detail

EnglishUnknownWordModel

EnglishUnknownWordModel

Method Detail

score

scoreProbTagGivenWordSignature

getSignatureIndex

getSignature