edu.stanford.nlp.parser.lexparser
Class BaseLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseLexicon
All Implemented Interfaces:
Lexicon, Serializable

public class BaseLexicon
extends Object
implements Lexicon

This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.

Author:
Dan Klein, Galen Andrew, Christopher Manning
See Also:
Serialized Form

Field Summary
protected static boolean DEBUG_LEXICON
           
protected static boolean DEBUG_LEXICON_SCORE
           
protected  boolean flexiTag
           
protected static IntTaggedWord NULL_ITW
           
protected static short nullTag
           
protected static int nullWord
           
protected  Options op
           
 List<IntTaggedWord>[] rulesWithWord
          An array of Lists of rules (IntTaggedWord), indexed by word.
 ClassicCounter<IntTaggedWord> seenCounter
          Records the number of times word/tag pair was seen in training data.
protected  boolean smartMutation
          Have tags changeable based on statistics on word types having various taggings.
protected  int smoothInUnknownsThreshold
          If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.
protected  Index<String> tagIndex
           
protected  Set<IntTaggedWord> tags
          Set of all tags as IntTaggedWord.
protected  TestOptions testOptions
           
protected  TrainOptions trainOptions
           
protected  boolean useSignatureForKnownSmoothing
           
protected  UnknownWordModel uwModel
           
protected  UnknownWordModelTrainer uwModelTrainer
           
protected  String uwModelTrainerClass
           
protected  Index<String> wordIndex
           
protected  Set<IntTaggedWord> words
           
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
BaseLexicon(Index<String> wordIndex, Index<String> tagIndex)
           
BaseLexicon(Options op, Index<String> wordIndex, Index<String> tagIndex)
           
 
Method Summary
 void addAll(List<TaggedWord> tagWords)
          Not yet implemented.
 void addAll(List<TaggedWord> taggedWords, double weight)
          Not yet implemented.
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 double evaluateCoverage(Collection<Tree> trees, Set<String> missingWords, Set<String> missingTags, Set<IntTaggedWord> missingTW)
          Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
protected  void examineIntersection(Set<String> s1, Set<String> s2)
           
 void finishTraining()
          Done collecting statistics for the lexicon.
 int getBaseTag(int tag, TreebankLanguagePack tlp)
           
 UnknownWordModel getUnknownWordModel()
           
 void initializeTraining(double numTrees)
          Start training this lexicon on the expected number of trees.
protected  void initRulesWithWord()
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(String word)
          Checks whether a word is in the lexicon.
protected  List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
           
static void main(String[] args)
          Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon.
 int numRules()
          Returns the number of rules (tag rewrites as word) in the Lexicon.
 void printLexStats()
          Print some statistics about this lexicon.
 void readData(BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
 Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
          Generate the possible taggings for a word at a sentence position.
 Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc)
          Returns the possible POS taggings for a word.
 Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc, String featureSpec)
          Same thing, but with a string that needs to be translated by the lexicon's word index
 float score(IntTaggedWord iTW, int loc, String word, String featureSpec)
          Get the score of this word with this tag (as an IntTaggedWord) at this location.
 void setUnknownWordModel(UnknownWordModel uwm)
           
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, Collection<Tree> rawTrees)
           
 void train(Collection<Tree> trees, double weight)
          Trains this lexicon on the Collection of trees.
 void train(List<TaggedWord> sentence, double weight)
          Not all subclasses support this particular method.
 void train(TaggedWord tw, int loc, double weight)
          Not all subclasses support this particular method.
 void train(Tree tree, double weight)
           
 void trainUnannotated(List<TaggedWord> sentence, double weight)
          Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.
 void trainWithExpansion(Collection<TaggedWord> taggedWords)
          Not yet implemented.
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
 void tune()
          TODO: this used to actually score things based on the original trees
 void writeData(Writer w)
          Writes out data from this Object to the Writer w.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uwModel

protected UnknownWordModel uwModel

uwModelTrainerClass

protected final String uwModelTrainerClass

uwModelTrainer

protected transient UnknownWordModelTrainer uwModelTrainer

DEBUG_LEXICON

protected static final boolean DEBUG_LEXICON
See Also:
Constant Field Values

DEBUG_LEXICON_SCORE

protected static final boolean DEBUG_LEXICON_SCORE
See Also:
Constant Field Values

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

NULL_ITW

protected static final IntTaggedWord NULL_ITW

trainOptions

protected final TrainOptions trainOptions

testOptions

protected final TestOptions testOptions

op

protected final Options op

smoothInUnknownsThreshold

protected int smoothInUnknownsThreshold
If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.


smartMutation

protected boolean smartMutation
Have tags changeable based on statistics on word types having various taggings.


wordIndex

protected final Index<String> wordIndex

tagIndex

protected final Index<String> tagIndex

rulesWithWord

public transient List<IntTaggedWord>[] rulesWithWord
An array of Lists of rules (IntTaggedWord), indexed by word.


tags

protected transient Set<IntTaggedWord> tags
Set of all tags as IntTaggedWord. Alive in both train and runtime phases, but transient.


words

protected transient Set<IntTaggedWord> words

seenCounter

public ClassicCounter<IntTaggedWord> seenCounter
Records the number of times word/tag pair was seen in training data. Includes word/tag pairs where one is a wildcard not a real word/tag.


flexiTag

protected boolean flexiTag

useSignatureForKnownSmoothing

protected boolean useSignatureForKnownSmoothing
Constructor Detail

BaseLexicon

public BaseLexicon(Index<String> wordIndex,
                   Index<String> tagIndex)

BaseLexicon

public BaseLexicon(Options op,
                   Index<String> wordIndex,
                   Index<String> tagIndex)
Method Detail

isKnown

public boolean isKnown(int word)
Checks whether a word is in the lexicon. This version will compile the lexicon into the rulesWithWord array, if that hasn't already happened

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int index to an Index
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(String word)
Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array). TODO: The previous version would insert rules into the wordNumberer. Is that the desired behavior? Why not test in some way that doesn't affect the index? For example, start by testing wordIndex.contains(word).

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public Iterator<IntTaggedWord> ruleIteratorByWord(String word,
                                                  int loc)
Returns the possible POS taggings for a word.

Parameters:
word - The word, represented as an integer in wordIndex
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

ruleIteratorByWord

public Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                  int loc,
                                                  String featureSpec)
Generate the possible taggings for a word at a sentence position. This may either be based on a strict lexicon or an expanded generous set of possible taggings.

Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).

Specified by:
ruleIteratorByWord in interface Lexicon
Parameters:
word - The word (as an int)
loc - Its index in the sentence (usually only relevant for unknown words)
featureSpec - Additional word features like morphosyntactic information.
Returns:
A list of possible taggings

ruleIteratorByWord

public Iterator<IntTaggedWord> ruleIteratorByWord(String word,
                                                  int loc,
                                                  String featureSpec)
Description copied from interface: Lexicon
Same thing, but with a string that needs to be translated by the lexicon's word index

Specified by:
ruleIteratorByWord in interface Lexicon

initRulesWithWord

protected void initRulesWithWord()

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

listToEvents

protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)

addAll

public void addAll(List<TaggedWord> tagWords)
Not yet implemented.


addAll

public void addAll(List<TaggedWord> taggedWords,
                   double weight)
Not yet implemented.


trainWithExpansion

public void trainWithExpansion(Collection<TaggedWord> taggedWords)
Not yet implemented.


initializeTraining

public void initializeTraining(double numTrees)
Description copied from interface: Lexicon
Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)

Specified by:
initializeTraining in interface Lexicon

train

public void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon
Parameters:
trees - Trees to train on

train

public void train(Collection<Tree> trees,
                  double weight)
Trains this lexicon on the Collection of trees. Also trains the unknown word model pointed to by this lexicon.

Specified by:
train in interface Lexicon

train

public void train(Tree tree,
                  double weight)
Specified by:
train in interface Lexicon

train

public final void train(List<TaggedWord> sentence,
                        double weight)
Description copied from interface: Lexicon
Not all subclasses support this particular method. Those that don't will barf...

Specified by:
train in interface Lexicon

trainUnannotated

public final void trainUnannotated(List<TaggedWord> sentence,
                                   double weight)
Description copied from interface: Lexicon
Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.

Specified by:
trainUnannotated in interface Lexicon

train

public void train(TaggedWord tw,
                  int loc,
                  double weight)
Description copied from interface: Lexicon
Not all subclasses support this particular method. Those that don't will barf...

Specified by:
train in interface Lexicon

finishTraining

public void finishTraining()
Description copied from interface: Lexicon
Done collecting statistics for the lexicon.

Specified by:
finishTraining in interface Lexicon

addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.


score

public float score(IntTaggedWord iTW,
                   int loc,
                   String word,
                   String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this location. (Presumably an estimate of P(word | tag).)

Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted

Specified by:
score in interface Lexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial
word - The word itself; useful so we don't have to look it up in an index
featureSpec - TODO
Returns:
A float score, usually, log P(word|tag)

tune

public final void tune()
TODO: this used to actually score things based on the original trees


readData

public void readData(BufferedReader in)
              throws IOException
Populates data in this Lexicon from the character stream given by the Reader r. TODO: this doesn't appear to correctly read in the UnknownWordModel in the case of a model more complicated than the unSeenCounter

Specified by:
readData in interface Lexicon
Parameters:
in - The BufferedReader to read from
Throws:
IOException - If any I/O problem

writeData

public void writeData(Writer w)
               throws IOException
Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.

Specified by:
writeData in interface Lexicon
Parameters:
w - The writer to output to
Throws:
IOException - If any I/O problem

numRules

public int numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.

Specified by:
numRules in interface Lexicon
Returns:
The number of rules (tag rewrites as word) in the Lexicon.

examineIntersection

protected void examineIntersection(Set<String> s1,
                                   Set<String> s2)

printLexStats

public void printLexStats()
Print some statistics about this lexicon.


evaluateCoverage

public double evaluateCoverage(Collection<Tree> trees,
                               Set<String> missingWords,
                               Set<String> missingTags,
                               Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results. Currently unused; this probably only works if train and test at same time so tags and words variables are initialized.


getBaseTag

public int getBaseTag(int tag,
                      TreebankLanguagePack tlp)

main

public static void main(String[] args)
Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon. What's here currently probably only works for the English Penn Treeebank, as it uses default constructors. Of the words given to test on, the first is treated as sentence initial, and the rest as not sentence initial.

Parameters:
args - The command line arguments: java BaseLexicon treebankPath fileRange unknownWordModel words*

getUnknownWordModel

public UnknownWordModel getUnknownWordModel()
Specified by:
getUnknownWordModel in interface Lexicon

setUnknownWordModel

public final void setUnknownWordModel(UnknownWordModel uwm)
Specified by:
setUnknownWordModel in interface Lexicon

train

public void train(Collection<Tree> trees,
                  Collection<Tree> rawTrees)
Specified by:
train in interface Lexicon


Stanford NLP Group