edu.stanford.nlp.parser.lexparser
Class ArabicUnknownWordModel

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
      extended by edu.stanford.nlp.parser.lexparser.ArabicUnknownWordModel
All Implemented Interfaces:
UnknownWordModel, Serializable

public class ArabicUnknownWordModel
extends BaseUnknownWordModel

This is a basic unknown word model for Arabic. It supports 4 different types of feature modeling; see getSignature(String, int). Implementation note: the contents of this class tend to overlap somewhat with EnglishUnknownWordModel and were originally included in BaseLexicon.

Author:
Dan Klein, Galen Andrew, Christopher Manning, Anna Rafferty
See Also:
Serialized Form

Field Summary
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected  boolean smartMutation
           
protected  int unknownPrefixSize
           
protected  int unknownSuffixSize
           
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
nullTag, nullWord, tagHash, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE
 
Constructor Summary
ArabicUnknownWordModel(Options.LexOptions op, Lexicon lex)
           
 
Method Summary
 String getSignature(String word, int loc)
          6-9 were added for Arabic.
 int getSignatureIndex(int wordIndex, int sentencePosition)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
 int getUnknownLevel()
          Get the level of equivalence classing for the model.
protected  List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
           
protected  List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
           
 float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth)
          Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
 void setUnknownLevel(int unknownLevel)
          One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class.
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, boolean keepTagsAsLabels)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, double weight)
           
 void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
           
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, score, scoreGT, scoreProbTagGivenWordSignature, trainUnknownGT, unSeenCounter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

smartMutation

protected boolean smartMutation

lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize

unknownSuffixSize

protected int unknownSuffixSize

unknownPrefixSize

protected int unknownPrefixSize
Constructor Detail

ArabicUnknownWordModel

public ArabicUnknownWordModel(Options.LexOptions op,
                              Lexicon lex)
Method Detail

train

public void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface UnknownWordModel
Overrides:
train in class BaseUnknownWordModel
Parameters:
trees - the collection of trees to be trained over

train

public void train(Collection<Tree> trees,
                  boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees.

Parameters:
trees - The trees tro build a lexicon from
keepTagsAsLabels - Whether tags should be represented as Labels or Strings in the lexicon.

train

public void train(Collection<Tree> trees,
                  double weight)

train

public void train(Collection<Tree> trees,
                  double weight,
                  boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree,
                                           boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

listToEvents

protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)

listOfLabeledWordsToEvents

protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)

score

public float score(IntTaggedWord iTW,
                   int loc,
                   double c_Tseen,
                   double total,
                   double smooth)
Description copied from class: BaseUnknownWordModel
Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.

Specified by:
score in interface UnknownWordModel
Overrides:
score in class BaseUnknownWordModel
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
c_Tseen - Total count of this tag (on seen words) in training
total - Total count of word tokens in training
smooth - Weighting on prior P(T|U) in estimate
Returns:
A double valued score, usually - log P(word|tag)

getSignatureIndex

public int getSignatureIndex(int wordIndex,
                             int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. Caches the last signature index returned.

Specified by:
getSignatureIndex in interface UnknownWordModel
Overrides:
getSignatureIndex in class BaseUnknownWordModel

getSignature

public String getSignature(String word,
                           int loc)
6-9 were added for Arabic. 6 looks for the prefix Al- (and knows that Buckwalter uses various symbols as letters), while 7 just looks for numbers and last letter. 8 looks for Al-, looks for several useful suffixes, and tracks the first letter of the word. (note that the first letter seems a bit more informative than the last letter, overall.) 9 tries to build on 8, but avoiding some of its perceived flaws: really it was using the first AND last letter.

Specified by:
getSignature in interface UnknownWordModel
Overrides:
getSignature in class BaseUnknownWordModel
Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

setUnknownLevel

public void setUnknownLevel(int unknownLevel)
Description copied from interface: UnknownWordModel
One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be used to set the model one would like. Effects of the level will vary based on the implementing class. If a given class only includes one model, setting the unknown level should have no effect.

Specified by:
setUnknownLevel in interface UnknownWordModel
Overrides:
setUnknownLevel in class BaseUnknownWordModel
Parameters:
unknownLevel - Provides a choice between different unknown word processing schemes

getUnknownLevel

public int getUnknownLevel()
Description copied from interface: UnknownWordModel
Get the level of equivalence classing for the model.

Specified by:
getUnknownLevel in interface UnknownWordModel
Overrides:
getUnknownLevel in class BaseUnknownWordModel
Returns:
The current level of unknown word equivalence classing


Stanford NLP Group