|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.BaseLexicon
public class BaseLexicon
This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.
Field Summary | |
---|---|
protected static boolean |
DEBUG_LEXICON
|
protected static boolean |
DEBUG_LEXICON_SCORE
|
protected int |
lastSentencePosition
|
protected int |
lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....) |
protected int |
lastWordToSignaturize
|
protected static short |
nullTag
|
protected static int |
nullWord
|
List<IntTaggedWord>[] |
rulesWithWord
An array of Lists of rules (IntTaggedWord), indexed by word. |
ClassicCounter<IntTaggedWord> |
seenCounter
Records the number of times word/tag pair was seen in training data. |
protected boolean |
smartMutation
Have tags changeable based on statistics on word types having various taggings. |
protected int |
smoothInUnknownsThreshold
If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors. |
protected Set<IntTaggedWord> |
tags
Set of all tags as IntTaggedWord. |
protected ClassicCounter<IntTaggedWord> |
unSeenCounter
Has counts for taggings in terms of unseen signatures. |
protected UnknownWordModel |
uwModel
|
protected Set<IntTaggedWord> |
words
|
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon |
---|
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD |
Constructor Summary | |
---|---|
BaseLexicon()
|
|
BaseLexicon(Options.LexOptions op)
|
Method Summary | |
---|---|
void |
addAll(List<TaggedWord> tagWords)
|
void |
addAll(List<TaggedWord> taggedWords,
double weight)
|
protected void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon. |
double |
evaluateCoverage(Collection<Tree> trees,
Set missingWords,
Set missingTags,
Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. |
int |
getBaseTag(int tag,
TreebankLanguagePack tlp)
|
UnknownWordModel |
getUnknownWordModel()
|
protected void |
initRulesWithWord()
|
boolean |
isKnown(int word)
Checks whether a word is in the lexicon. |
boolean |
isKnown(String word)
Checks whether a word is in the lexicon. |
protected List<IntTaggedWord> |
listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
|
protected List<IntTaggedWord> |
listToEvents(List<TaggedWord> taggedWords)
|
static void |
main(String[] args)
Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon. |
int |
numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. |
void |
printLexStats()
Print some statistics about this lexicon. |
void |
readData(BufferedReader in)
Populates data in this Lexicon from the character stream given by the Reader r. |
Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc)
Generate the possible taggings for a word at a sentence position. |
Iterator<IntTaggedWord> |
ruleIteratorByWord(String word,
int loc)
Returns the possible POS taggings for a word. |
float |
score(IntTaggedWord iTW,
int loc)
Get the score of this word with this tag (as an IntTaggedWord) at this location. |
void |
setTagNumberer(Numberer tagNumberer)
|
void |
setUnknownWordModel(UnknownWordModel uwm)
|
void |
setWordNumberer(Numberer wordNumberer)
|
void |
train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees. |
void |
train(Collection<Tree> trees,
boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees. |
void |
train(Collection<Tree> trees,
double weight)
|
void |
train(Collection<Tree> trees,
double weight,
boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees. |
void |
trainWithExpansion(Collection<TaggedWord> taggedWords)
|
protected List<IntTaggedWord> |
treeToEvents(Tree tree)
|
protected List<IntTaggedWord> |
treeToEvents(Tree tree,
boolean keepTagsAsLabels)
|
void |
tune(Collection<Tree> trees)
|
void |
writeData(Writer w)
Writes out data from this Object to the Writer w. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected UnknownWordModel uwModel
protected static final boolean DEBUG_LEXICON
protected static final boolean DEBUG_LEXICON_SCORE
protected static final int nullWord
protected static final short nullTag
protected int smoothInUnknownsThreshold
protected boolean smartMutation
public transient List<IntTaggedWord>[] rulesWithWord
protected transient Set<IntTaggedWord> tags
protected transient Set<IntTaggedWord> words
public ClassicCounter<IntTaggedWord> seenCounter
protected ClassicCounter<IntTaggedWord> unSeenCounter
protected transient int lastSignatureIndex
protected transient int lastSentencePosition
protected transient int lastWordToSignaturize
Constructor Detail |
---|
public BaseLexicon()
public BaseLexicon(Options.LexOptions op)
Method Detail |
---|
public boolean isKnown(int word)
isKnown
in interface Lexicon
word
- The word as an int index to a Numberer
public boolean isKnown(String word)
isKnown
in interface Lexicon
word
- The word as a String
public Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc)
word
- The word, represented as an integer in Numbererloc
- The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't actually
make use of this position information.
tag -> word rule.)
public Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc)
Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).
ruleIteratorByWord
in interface Lexicon
word
- The word (as an int)loc
- Its index in the sentence (usually only relevant for unknown words)
protected void initRulesWithWord()
protected List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
protected List<IntTaggedWord> treeToEvents(Tree tree)
protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
public void addAll(List<TaggedWord> tagWords)
public void addAll(List<TaggedWord> taggedWords, double weight)
public void trainWithExpansion(Collection<TaggedWord> taggedWords)
public void train(Collection<Tree> trees)
train
in interface Lexicon
public void train(Collection<Tree> trees, boolean keepTagsAsLabels)
public void train(Collection<Tree> trees, double weight)
public void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
protected void addTagging(boolean seen, IntTaggedWord itw, double count)
public float score(IntTaggedWord iTW, int loc)
Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
score
in interface Lexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initial
public void tune(Collection<Tree> trees)
public void readData(BufferedReader in) throws IOException
readData
in interface Lexicon
in
- The BufferedReader to read from
IOException
public void writeData(Writer w) throws IOException
writeData
in interface Lexicon
w
- The writer to output to
IOException
public int numRules()
numRules
in interface Lexicon
public void printLexStats()
public double evaluateCoverage(Collection<Tree> trees, Set missingWords, Set missingTags, Set<IntTaggedWord> missingTW)
public int getBaseTag(int tag, TreebankLanguagePack tlp)
public static void main(String[] args)
args
- The command line arguments:
java BaseLexicon treebankPath fileRange unknownWordModel words*public void setWordNumberer(Numberer wordNumberer)
public void setTagNumberer(Numberer tagNumberer)
public UnknownWordModel getUnknownWordModel()
getUnknownWordModel
in interface Lexicon
public void setUnknownWordModel(UnknownWordModel uwm)
setUnknownWordModel
in interface Lexicon
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |