edu.stanford.nlp.parser.lexparser
Class BaseLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseLexicon
All Implemented Interfaces:
Lexicon, Serializable
Direct Known Subclasses:
ChineseLexicon, FactoredLexicon

public class BaseLexicon
extends Object
implements Lexicon

This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.

Author:
Dan Klein, Galen Andrew, Christopher Manning
See Also:
Serialized Form

Field Summary
protected static boolean DEBUG_LEXICON
           
protected static boolean DEBUG_LEXICON_SCORE
           
protected  boolean flexiTag
           
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected static short nullTag
           
protected static int nullWord
           
 List<IntTaggedWord>[] rulesWithWord
          An array of Lists of rules (IntTaggedWord), indexed by word.
 ClassicCounter<IntTaggedWord> seenCounter
          Records the number of times word/tag pair was seen in training data.
protected  boolean smartMutation
          Have tags changeable based on statistics on word types having various taggings.
protected  int smoothInUnknownsThreshold
          If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.
protected  Numberer tagNumberer
           
protected  Set<IntTaggedWord> tags
          Set of all tags as IntTaggedWord.
protected  UnknownWordModel uwModel
           
protected  Numberer wordNumberer
           
protected  Set<IntTaggedWord> words
           
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
BaseLexicon()
           
BaseLexicon(Options.LexOptions op)
           
 
Method Summary
 void addAll(List<TaggedWord> tagWords)
          Not yet implemented.
 void addAll(List<TaggedWord> taggedWords, double weight)
          Not yet implemented.
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 double evaluateCoverage(Collection<Tree> trees, Set<String> missingWords, Set<String> missingTags, Set<IntTaggedWord> missingTW)
          Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
 int getBaseTag(int tag, TreebankLanguagePack tlp)
           
 UnknownWordModel getUnknownWordModel()
           
protected  void initRulesWithWord()
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(String word)
          Checks whether a word is in the lexicon.
protected  List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
           
protected  List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
           
static void main(String[] args)
          Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon.
 int numRules()
          Returns the number of rules (tag rewrites as word) in the Lexicon.
 void printLexStats()
          Print some statistics about this lexicon.
 void readData(BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
 Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
          Generate the possible taggings for a word at a sentence position.
 Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc)
          Returns the possible POS taggings for a word.
 float score(IntTaggedWord iTW, int loc)
          Get the score of this word with this tag (as an IntTaggedWord) at this location.
 void setTagNumberer(Numberer tagNumberer)
           
 void setUnknownWordModel(UnknownWordModel uwm)
           
 void setWordNumberer(Numberer wordNumberer)
           
protected  Numberer tagNumberer()
           
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, boolean keepTagsAsLabels)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, double weight)
           
 void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
          Trains this lexicon on the Collection of trees.
 void trainWithExpansion(Collection<TaggedWord> taggedWords)
          Not yet implemented.
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
protected  List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
           
 void tune(Collection<Tree> trees)
           
protected  Numberer wordNumberer()
           
 void writeData(Writer w)
          Writes out data from this Object to the Writer w.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uwModel

protected UnknownWordModel uwModel

DEBUG_LEXICON

protected static final boolean DEBUG_LEXICON
See Also:
Constant Field Values

DEBUG_LEXICON_SCORE

protected static final boolean DEBUG_LEXICON_SCORE
See Also:
Constant Field Values

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

smoothInUnknownsThreshold

protected int smoothInUnknownsThreshold
If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.


smartMutation

protected boolean smartMutation
Have tags changeable based on statistics on word types having various taggings.


rulesWithWord

public transient List<IntTaggedWord>[] rulesWithWord
An array of Lists of rules (IntTaggedWord), indexed by word.


tags

protected transient Set<IntTaggedWord> tags
Set of all tags as IntTaggedWord. Alive in both train and runtime phases, but transient.


words

protected transient Set<IntTaggedWord> words

seenCounter

public ClassicCounter<IntTaggedWord> seenCounter
Records the number of times word/tag pair was seen in training data. Includes word/tag pairs where one is a wildcard not a real word/tag.


lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize

flexiTag

protected boolean flexiTag

tagNumberer

protected transient Numberer tagNumberer

wordNumberer

protected transient Numberer wordNumberer
Constructor Detail

BaseLexicon

public BaseLexicon()

BaseLexicon

public BaseLexicon(Options.LexOptions op)
Method Detail

isKnown

public boolean isKnown(int word)
Checks whether a word is in the lexicon. This version will compile the lexicon into the rulesWithWord array, if that hasn't already happened

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int index to a Numberer
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(String word)
Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array).

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public Iterator<IntTaggedWord> ruleIteratorByWord(String word,
                                                  int loc)
Returns the possible POS taggings for a word.

Parameters:
word - The word, represented as an integer in Numberer
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

ruleIteratorByWord

public Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                  int loc,
                                                  String featureSpec)
Generate the possible taggings for a word at a sentence position. This may either be based on a strict lexicon or an expanded generous set of possible taggings.

Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).

Specified by:
ruleIteratorByWord in interface Lexicon
Parameters:
word - The word (as an int)
loc - Its index in the sentence (usually only relevant for unknown words)
featureSpec - Additional word features like morphosyntactic information.
Returns:
A list of possible taggings

initRulesWithWord

protected void initRulesWithWord()

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree,
                                           boolean keepTagsAsLabels)

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

listToEvents

protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)

listOfLabeledWordsToEvents

protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)

addAll

public void addAll(List<TaggedWord> tagWords)
Not yet implemented.


addAll

public void addAll(List<TaggedWord> taggedWords,
                   double weight)
Not yet implemented.


trainWithExpansion

public void trainWithExpansion(Collection<TaggedWord> taggedWords)
Not yet implemented.


train

public void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon
Parameters:
trees - Trees to train on

train

public void train(Collection<Tree> trees,
                  boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees.


train

public void train(Collection<Tree> trees,
                  double weight)

train

public void train(Collection<Tree> trees,
                  double weight,
                  boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees. Also trains the unknown word model pointed to by this lexicon.


addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.


score

public float score(IntTaggedWord iTW,
                   int loc)
Get the score of this word with this tag (as an IntTaggedWord) at this location. (Presumably an estimate of P(word | tag).)

Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted

Specified by:
score in interface Lexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial
Returns:
A float score, usually, log P(word|tag)

tune

public void tune(Collection<Tree> trees)

readData

public void readData(BufferedReader in)
              throws IOException
Populates data in this Lexicon from the character stream given by the Reader r.

Specified by:
readData in interface Lexicon
Parameters:
in - The BufferedReader to read from
Throws:
IOException - If any I/O problem

writeData

public void writeData(Writer w)
               throws IOException
Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.

Specified by:
writeData in interface Lexicon
Parameters:
w - The writer to output to
Throws:
IOException - If any I/O problem

numRules

public int numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.

Specified by:
numRules in interface Lexicon
Returns:
The number of rules (tag rewrites as word) in the Lexicon.

printLexStats

public void printLexStats()
Print some statistics about this lexicon.


evaluateCoverage

public double evaluateCoverage(Collection<Tree> trees,
                               Set<String> missingWords,
                               Set<String> missingTags,
                               Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results. Currently unused; this probably only works if train and test at same time so tags and words variables are initialized.


getBaseTag

public int getBaseTag(int tag,
                      TreebankLanguagePack tlp)

main

public static void main(String[] args)
Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon. What's here currently probably only works for the English Penn Treeebank, as it uses default constructors. Of the words given to test on, the first is treated as sentence initial, and the rest as not sentence initial.

Parameters:
args - The command line arguments: java BaseLexicon treebankPath fileRange unknownWordModel words*

tagNumberer

protected Numberer tagNumberer()

wordNumberer

protected Numberer wordNumberer()

setWordNumberer

public void setWordNumberer(Numberer wordNumberer)

setTagNumberer

public void setTagNumberer(Numberer tagNumberer)

getUnknownWordModel

public UnknownWordModel getUnknownWordModel()
Specified by:
getUnknownWordModel in interface Lexicon

setUnknownWordModel

public final void setUnknownWordModel(UnknownWordModel uwm)
Specified by:
setUnknownWordModel in interface Lexicon


Stanford NLP Group