ArabicUnknownWordModel (Stanford CoreNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
- - edu.stanford.nlp.parser.lexparser.ArabicUnknownWordModel

All Implemented Interfaces:

UnknownWordModel, Serializable
```
public class ArabicUnknownWordModel
extends BaseUnknownWordModel
```
This is a basic unknown word model for Arabic. It supports 4 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with EnglishUnknownWordModel and were originally included in BaseLexicon.

Author:

Roger Levy, Christopher Manning, Anna Rafferty

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

protected boolean smartMutation

protected int unknownPrefixSize

protected int unknownSuffixSize

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex

Constructor Summary

Constructors
Constructor and Description

ArabicUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)
This constructor creates an UWM with empty data structures.

ArabicUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type Method and Description

String getSignature(String word, int loc)
6-9 were added for Arabic.

int getSignatureIndex(int index, int sentencePosition, String word)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.

int getUnknownLevel()
Get the level of equivalence classing for the model.

float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, score, scoreGT, scoreProbTagGivenWordSignature, unSeenCounter

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

smartMutation

protected final boolean smartMutation

unknownSuffixSize

protected final int unknownSuffixSize

unknownPrefixSize

protected final int unknownPrefixSize

Constructor Detail

ArabicUnknownWordModel

public ArabicUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)

ArabicUnknownWordModel

public ArabicUnknownWordModel(Options op, Lexicon lex, Index<String> wordIndex, Index<String> tagIndex)

This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.

Method Detail

score

public float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)

Description copied from class: BaseUnknownWordModel

Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Specified by:

score in interface UnknownWordModel

Overrides:

score in class BaseUnknownWordModel

Parameters:

iTW - An IntTaggedWord pairing a word and POS tag

loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value

c_Tseen - Total count of this tag (on seen words) in training

total - Total count of word tokens in training

smooth - Weighting on prior P(T|U) in estimate

word - The word itself; useful so we don't look it up in the index

Returns:

A double valued score, usually - log P(word|tag)

getSignatureIndex

public int getSignatureIndex(int index, int sentencePosition, String word)

Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.

Specified by:

getSignatureIndex in interface UnknownWordModel

Overrides:

getSignatureIndex in class BaseUnknownWordModel

getSignature

public String getSignature(String word, int loc)

6-9 were added for Arabic. 6 looks for the prefix Al- (and knows that Buckwalter uses various symbols as letters), while 7 just looks for numbers and last letter. 8 looks for Al-, looks for several useful suffixes, and tracks the first letter of the word. (note that the first letter seems a bit more informative than the last letter, overall.) 9 tries to build on 8, but avoiding some of its perceived flaws: really it was using the first AND last letter.

Specified by:

getSignature in interface UnknownWordModel

Overrides:

getSignature in class BaseUnknownWordModel

Parameters:

word - The word to make a signature for

loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)

Returns:

A String that is its signature (equivalence class)

getUnknownLevel

public int getUnknownLevel()

Description copied from interface: UnknownWordModel

Get the level of equivalence classing for the model. One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be queried with this method.

Specified by:

getUnknownLevel in interface UnknownWordModel

Overrides:

getUnknownLevel in class BaseUnknownWordModel

Returns:

The current level of unknown word equivalence classing

Modifier and Type	Field and Description
`protected boolean`	`smartMutation`
`protected int`	`unknownPrefixSize`
`protected int`	`unknownSuffixSize`

Modifier and Type	Method and Description
`String`	`getSignature(String word, int loc)` 6-9 were added for Arabic.
`int`	`getSignatureIndex(int index, int sentencePosition, String word)` Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
`int`	`getUnknownLevel()` Get the level of equivalence classing for the model.
`float`	`score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, String word)` Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Class ArabicUnknownWordModel

Field Summary

Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

Methods inherited from class java.lang.Object

Field Detail

smartMutation

unknownSuffixSize

unknownPrefixSize

Constructor Detail

ArabicUnknownWordModel

ArabicUnknownWordModel

Method Detail

score

getSignatureIndex

getSignature

getUnknownLevel