BaseLexicon (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.BaseLexicon

All Implemented Interfaces:

Lexicon, java.io.Serializable

Direct Known Subclasses:

ChineseLexicon, FactoredLexicon
```
public class BaseLexicon
extends java.lang.Object
implements Lexicon
```
This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.

Author:

Dan Klein, Galen Andrew, Christopher Manning

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected static boolean`	`DEBUG_LEXICON`
`protected static boolean`	`DEBUG_LEXICON_SCORE`
`protected boolean`	`flexiTag`
`protected static IntTaggedWord`	`NULL_ITW`
`protected static short`	`nullTag`
`protected static int`	`nullWord`
`protected Options`	`op`
`java.util.List<IntTaggedWord>[]`	`rulesWithWord` An array of Lists of rules (IntTaggedWord), indexed by word.
`ClassicCounter<IntTaggedWord>`	`seenCounter` Records the number of times word/tag pair was seen in training data.
`protected boolean`	`smartMutation` Have tags changeable based on statistics on word types having various taggings.
`protected int`	`smoothInUnknownsThreshold` If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.
`protected Index<java.lang.String>`	`tagIndex`
`protected java.util.Set<IntTaggedWord>`	`tags` Set of all tags as IntTaggedWord.
`protected TestOptions`	`testOptions`
`protected TrainOptions`	`trainOptions`
`protected boolean`	`useSignatureForKnownSmoothing`
`protected UnknownWordModel`	`uwModel`
`protected UnknownWordModelTrainer`	`uwModelTrainer`
`protected java.lang.String`	`uwModelTrainerClass`
`protected Index<java.lang.String>`	`wordIndex`
`protected java.util.Set<IntTaggedWord>`	`words`

Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD

Constructor Summary

Constructors
Constructor and Description
`BaseLexicon(Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)`
`BaseLexicon(Options op, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addAll(java.util.List<TaggedWord> tagWords)` Not yet implemented.
`void`	`addAll(java.util.List<TaggedWord> taggedWords, double weight)` Not yet implemented.
`protected void`	`addTagging(boolean seen, IntTaggedWord itw, double count)` Adds the tagging with count to the data structures in this Lexicon.
`double`	`evaluateCoverage(java.util.Collection<Tree> trees, java.util.Set<java.lang.String> missingWords, java.util.Set<java.lang.String> missingTags, java.util.Set<IntTaggedWord> missingTW)` Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
`protected static void`	`examineIntersection(java.util.Set<java.lang.String> s1, java.util.Set<java.lang.String> s2)`
`void`	`finishTraining()` Done collecting statistics for the lexicon.
`int`	`getBaseTag(int tag, TreebankLanguagePack tlp)`
`UnknownWordModel`	`getUnknownWordModel()`
`void`	`incrementTreesRead(double weight)` If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
`void`	`initializeTraining(double numTrees)` Start training this lexicon on the expected number of trees.
`protected void`	`initRulesWithWord()`
`boolean`	`isKnown(int word)` Checks whether a word is in the lexicon.
`boolean`	`isKnown(java.lang.String word)` Checks whether a word is in the lexicon.
`protected java.util.List<IntTaggedWord>`	`listToEvents(java.util.List<TaggedWord> taggedWords)`
`static void`	`main(java.lang.String[] args)` Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon.
`int`	`numRules()` Returns the number of rules (tag rewrites as word) in the Lexicon.
`void`	`printLexStats()` Print some statistics about this lexicon.
`void`	`readData(java.io.BufferedReader in)` Populates data in this Lexicon from the character stream given by the Reader r.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)` Generate the possible taggings for a word at a sentence position.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(java.lang.String word, int loc)` Returns the possible POS taggings for a word.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)` Same thing, but with a string that needs to be translated by the lexicon's word index
`float`	`score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)` Get the score of this word with this tag (as an IntTaggedWord) at this location.
`void`	`setUnknownWordModel(UnknownWordModel uwm)`
`java.util.Set<java.lang.String>`	`tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)` Return the Set of tags used by this tagger (available after training the tagger).
`void`	`train(java.util.Collection<Tree> trees)` Trains this lexicon on the Collection of trees.
`void`	`train(java.util.Collection<Tree> trees, java.util.Collection<Tree> rawTrees)`
`void`	`train(java.util.Collection<Tree> trees, double weight)` Trains this lexicon on the Collection of trees.
`void`	`train(java.util.List<TaggedWord> sentence, double weight)` Not all subclasses support this particular method.
`void`	`train(TaggedWord tw, int loc, double weight)` Not all subclasses support this particular method.
`void`	`train(Tree tree, double weight)`
`void`	`trainUnannotated(java.util.List<TaggedWord> sentence, double weight)` Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
`void`	`trainWithExpansion(java.util.Collection<TaggedWord> taggedWords)` Not yet implemented.
`protected java.util.List<IntTaggedWord>`	`treeToEvents(Tree tree)`
`void`	`tune()` TODO: this used to actually score things based on the original trees
`void`	`writeData(java.io.Writer w)` Writes out data from this Object to the Writer w.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - uwModel
```
protected UnknownWordModel uwModel
```
  - uwModelTrainerClass
```
protected final java.lang.String uwModelTrainerClass
```
  - uwModelTrainer
```
protected transient UnknownWordModelTrainer uwModelTrainer
```
  - DEBUG_LEXICON
```
protected static final boolean DEBUG_LEXICON
```
    See Also:
    
    Constant Field Values
  - DEBUG_LEXICON_SCORE
```
protected static final boolean DEBUG_LEXICON_SCORE
```
    See Also:
    
    Constant Field Values
  - nullWord
```
protected static final int nullWord
```
    See Also:
    
    Constant Field Values
  - nullTag
```
protected static final short nullTag
```
    See Also:
    
    Constant Field Values
  - NULL_ITW
```
protected static final IntTaggedWord NULL_ITW
```
  - trainOptions
```
protected final TrainOptions trainOptions
```
  - testOptions
```
protected final TestOptions testOptions
```
  - op
```
protected final Options op
```
  - smoothInUnknownsThreshold
```
protected int smoothInUnknownsThreshold
```
    If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors.
  - smartMutation
```
protected boolean smartMutation
```
    Have tags changeable based on statistics on word types having various taggings.
  - wordIndex
```
protected final Index<java.lang.String> wordIndex
```
  - tagIndex
```
protected final Index<java.lang.String> tagIndex
```
  - rulesWithWord
```
public transient java.util.List<IntTaggedWord>[] rulesWithWord
```
    An array of Lists of rules (IntTaggedWord), indexed by word.
  - tags
```
protected transient java.util.Set<IntTaggedWord> tags
```
    Set of all tags as IntTaggedWord. Alive in both train and runtime phases, but transient.
  - words
```
protected transient java.util.Set<IntTaggedWord> words
```
  - seenCounter
```
public ClassicCounter<IntTaggedWord> seenCounter
```
    Records the number of times word/tag pair was seen in training data. Includes word/tag pairs where one is a wildcard not a real word/tag.
  - flexiTag
```
protected boolean flexiTag
```
  - useSignatureForKnownSmoothing
```
protected boolean useSignatureForKnownSmoothing
```
- Constructor Detail
  - BaseLexicon
```
public BaseLexicon(Index<java.lang.String> wordIndex,
                   Index<java.lang.String> tagIndex)
```
  - BaseLexicon
```
public BaseLexicon(Options op,
                   Index<java.lang.String> wordIndex,
                   Index<java.lang.String> tagIndex)
```
- Method Detail
  - isKnown
```
public boolean isKnown(int word)
```
    Checks whether a word is in the lexicon. This version will compile the lexicon into the rulesWithWord array, if that hasn't already happened
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as an int index to an Index
    
    Returns:
    
    Whether the word is in the lexicon
  - isKnown
```
public boolean isKnown(java.lang.String word)
```
    Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array). TODO: The previous version would insert rules into the wordNumberer. Is that the desired behavior? Why not test in some way that doesn't affect the index? For example, start by testing wordIndex.contains(word).
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as a String
    
    Returns:
    
    Whether the word is in the lexicon
  - tagSet
```
public java.util.Set<java.lang.String> tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
```
    Return the Set of tags used by this tagger (available after training the tagger).
    
    Specified by:
    
    tagSet in interface Lexicon
    
    Returns:
    
    The Set of tags used by this tagger
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word,
                                                            int loc)
```
    Returns the possible POS taggings for a word.
    
    Parameters:
    
    word - The word, represented as an integer in wordIndex
    
    loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
    
    Returns:
    
    An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Generate the possible taggings for a word at a sentence position. This may either be based on a strict lexicon or an expanded generous set of possible taggings.
    Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
    
    Parameters:
    
    word - The word (as an int)
    
    loc - Its index in the sentence (usually only relevant for unknown words)
    
    featureSpec - Additional word features like morphosyntactic information.
    
    Returns:
    
    A list of possible taggings
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Same thing, but with a string that needs to be translated by the lexicon's word index
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
  - initRulesWithWord
```
protected void initRulesWithWord()
```
  - treeToEvents
```
protected java.util.List<IntTaggedWord> treeToEvents(Tree tree)
```
  - listToEvents
```
protected java.util.List<IntTaggedWord> listToEvents(java.util.List<TaggedWord> taggedWords)
```
  - addAll
```
public void addAll(java.util.List<TaggedWord> tagWords)
```
    Not yet implemented.
  - addAll
```
public void addAll(java.util.List<TaggedWord> taggedWords,
                   double weight)
```
    Not yet implemented.
  - trainWithExpansion
```
public void trainWithExpansion(java.util.Collection<TaggedWord> taggedWords)
```
    Not yet implemented.
  - initializeTraining
```
public void initializeTraining(double numTrees)
```
    Description copied from interface: Lexicon
    
    Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)
    
    Specified by:
    
    initializeTraining in interface Lexicon
  - train
```
public void train(java.util.Collection<Tree> trees)
```
    Trains this lexicon on the Collection of trees.
    
    Specified by:
    
    train in interface Lexicon
    
    Parameters:
    
    trees - Trees to train on
  - train
```
public void train(java.util.Collection<Tree> trees,
                  double weight)
```
    Trains this lexicon on the Collection of trees. Also trains the unknown word model pointed to by this lexicon.
    
    Specified by:
    
    train in interface Lexicon
  - train
```
public void train(Tree tree,
                  double weight)
```
    Specified by:
    
    train in interface Lexicon
  - train
```
public final void train(java.util.List<TaggedWord> sentence,
                        double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - incrementTreesRead
```
public final void incrementTreesRead(double weight)
```
    Description copied from interface: Lexicon
    
    If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
    
    Specified by:
    
    incrementTreesRead in interface Lexicon
  - trainUnannotated
```
public final void trainUnannotated(java.util.List<TaggedWord> sentence,
                                   double weight)
```
    Description copied from interface: Lexicon
    
    Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
    
    Specified by:
    
    trainUnannotated in interface Lexicon
  - train
```
public void train(TaggedWord tw,
                  int loc,
                  double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - finishTraining
```
public void finishTraining()
```
    Description copied from interface: Lexicon
    
    Done collecting statistics for the lexicon.
    
    Specified by:
    
    finishTraining in interface Lexicon
  - addTagging
```
protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
```
    Adds the tagging with count to the data structures in this Lexicon.
  - score
```
public float score(IntTaggedWord iTW,
                   int loc,
                   java.lang.String word,
                   java.lang.String featureSpec)
```
    Get the score of this word with this tag (as an IntTaggedWord) at this location. (Presumably an estimate of P(word | tag).)
    Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
    
    Specified by:
    
    score in interface Lexicon
    
    Parameters:
    
    iTW - An IntTaggedWord pairing a word and POS tag
    
    loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial
    
    word - The word itself; useful so we don't have to look it up in an index
    
    featureSpec - TODO
    
    Returns:
    
    A float score, usually, log P(word|tag)
  - tune
```
public final void tune()
```
    TODO: this used to actually score things based on the original trees
  - readData
```
public void readData(java.io.BufferedReader in)
              throws java.io.IOException
```
    Populates data in this Lexicon from the character stream given by the Reader r. TODO: this doesn't appear to correctly read in the UnknownWordModel in the case of a model more complicated than the unSeenCounter
    
    Specified by:
    
    readData in interface Lexicon
    
    Parameters:
    
    in - The BufferedReader to read from
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - writeData
```
public void writeData(java.io.Writer w)
               throws java.io.IOException
```
    Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.
    
    Specified by:
    
    writeData in interface Lexicon
    
    Parameters:
    
    w - The writer to output to
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - numRules
```
public int numRules()
```
    Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.
    
    Specified by:
    
    numRules in interface Lexicon
    
    Returns:
    
    The number of rules (tag rewrites as word) in the Lexicon.
  - examineIntersection
```
protected static void examineIntersection(java.util.Set<java.lang.String> s1,
                                          java.util.Set<java.lang.String> s2)
```
  - printLexStats
```
public void printLexStats()
```
    Print some statistics about this lexicon.
  - evaluateCoverage
```
public double evaluateCoverage(java.util.Collection<Tree> trees,
                               java.util.Set<java.lang.String> missingWords,
                               java.util.Set<java.lang.String> missingTags,
                               java.util.Set<IntTaggedWord> missingTW)
```
    Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results. Currently unused; this probably only works if train and test at same time so tags and words variables are initialized.
  - getBaseTag
```
public int getBaseTag(int tag,
                      TreebankLanguagePack tlp)
```
  - main
```
public static void main(java.lang.String[] args)
```
    Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon. What's here currently probably only works for the English Penn Treeebank, as it uses default constructors. Of the words given to test on, the first is treated as sentence initial, and the rest as not sentence initial.
    
    Parameters:
    
    args - The command line arguments: java BaseLexicon treebankPath fileRange unknownWordModel words*
  - getUnknownWordModel
```
public UnknownWordModel getUnknownWordModel()
```
    Specified by:
    
    getUnknownWordModel in interface Lexicon
  - setUnknownWordModel
```
public final void setUnknownWordModel(UnknownWordModel uwm)
```
    Specified by:
    
    setUnknownWordModel in interface Lexicon
  - train
```
public void train(java.util.Collection<Tree> trees,
                  java.util.Collection<Tree> rawTrees)
```
    Specified by:
    
    train in interface Lexicon

Class BaseLexicon

Field Summary

Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

uwModel

uwModelTrainerClass

uwModelTrainer

DEBUG_LEXICON

DEBUG_LEXICON_SCORE

nullWord

nullTag

NULL_ITW

trainOptions

testOptions

op

smoothInUnknownsThreshold

smartMutation

wordIndex

tagIndex

rulesWithWord

tags

words

seenCounter

flexiTag

useSignatureForKnownSmoothing

Constructor Detail

BaseLexicon

BaseLexicon

Method Detail

isKnown

isKnown

tagSet

ruleIteratorByWord

ruleIteratorByWord

ruleIteratorByWord

initRulesWithWord

treeToEvents

listToEvents

addAll

addAll

trainWithExpansion

initializeTraining

train

train

train

train

incrementTreesRead

trainUnannotated

train

finishTraining

addTagging

score

tune

readData

writeData

numRules

examineIntersection

printLexStats

evaluateCoverage

getBaseTag

main

getUnknownWordModel

setUnknownWordModel

train