|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.BaseLexicon
public class BaseLexicon
This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.
Field Summary | |
---|---|
protected static boolean |
DEBUG_LEXICON
|
protected static boolean |
DEBUG_LEXICON_SCORE
|
protected boolean |
flexiTag
|
protected static IntTaggedWord |
NULL_ITW
|
protected static short |
nullTag
|
protected static int |
nullWord
|
List<IntTaggedWord>[] |
rulesWithWord
An array of Lists of rules (IntTaggedWord), indexed by word. |
ClassicCounter<IntTaggedWord> |
seenCounter
Records the number of times word/tag pair was seen in training data. |
protected boolean |
smartMutation
Have tags changeable based on statistics on word types having various taggings. |
protected int |
smoothInUnknownsThreshold
If a word has been seen more than this many times, then relative frequencies of tags are used for POS assignment; if not, they are smoothed with tag priors. |
protected Index<String> |
tagIndex
|
protected Set<IntTaggedWord> |
tags
Set of all tags as IntTaggedWord. |
protected TestOptions |
testOptions
|
protected TrainOptions |
trainOptions
|
protected UnknownWordModel |
uwModel
|
protected Index<String> |
wordIndex
|
protected Set<IntTaggedWord> |
words
|
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon |
---|
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD |
Constructor Summary | |
---|---|
BaseLexicon(Index<String> wordIndex,
Index<String> tagIndex)
|
|
BaseLexicon(Options op,
Index<String> wordIndex,
Index<String> tagIndex)
|
Method Summary | |
---|---|
void |
addAll(List<TaggedWord> tagWords)
Not yet implemented. |
void |
addAll(List<TaggedWord> taggedWords,
double weight)
Not yet implemented. |
protected void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon. |
double |
evaluateCoverage(Collection<Tree> trees,
Set<String> missingWords,
Set<String> missingTags,
Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. |
int |
getBaseTag(int tag,
TreebankLanguagePack tlp)
|
UnknownWordModel |
getUnknownWordModel()
|
protected void |
initRulesWithWord()
|
boolean |
isKnown(int word)
Checks whether a word is in the lexicon. |
boolean |
isKnown(String word)
Checks whether a word is in the lexicon. |
protected List<IntTaggedWord> |
listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
|
protected List<IntTaggedWord> |
listToEvents(List<TaggedWord> taggedWords)
|
static void |
main(String[] args)
Provides some testing and opportunities for exploration of the probabilities of a BaseLexicon. |
int |
numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. |
void |
printLexStats()
Print some statistics about this lexicon. |
void |
readData(BufferedReader in)
Populates data in this Lexicon from the character stream given by the Reader r. |
Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
String featureSpec)
Generate the possible taggings for a word at a sentence position. |
Iterator<IntTaggedWord> |
ruleIteratorByWord(String word,
int loc)
Returns the possible POS taggings for a word. |
Iterator<IntTaggedWord> |
ruleIteratorByWord(String word,
int loc,
String featureSpec)
Same thing, but with a string that needs to be translated by the lexicon's word index |
float |
score(IntTaggedWord iTW,
int loc,
String word)
Get the score of this word with this tag (as an IntTaggedWord) at this location. |
void |
setUnknownWordModel(UnknownWordModel uwm)
|
void |
train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees. |
void |
train(Collection<Tree> trees,
boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees. |
void |
train(Collection<Tree> trees,
double weight)
|
void |
train(Collection<Tree> trees,
double weight,
boolean keepTagsAsLabels)
Trains this lexicon on the Collection of trees. |
void |
trainWithExpansion(Collection<TaggedWord> taggedWords)
Not yet implemented. |
protected List<IntTaggedWord> |
treeToEvents(Tree tree)
|
protected List<IntTaggedWord> |
treeToEvents(Tree tree,
boolean keepTagsAsLabels)
|
void |
tune(Collection<Tree> trees)
|
void |
writeData(Writer w)
Writes out data from this Object to the Writer w. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected UnknownWordModel uwModel
protected static final boolean DEBUG_LEXICON
protected static final boolean DEBUG_LEXICON_SCORE
protected static final int nullWord
protected static final short nullTag
protected static final IntTaggedWord NULL_ITW
protected final TrainOptions trainOptions
protected final TestOptions testOptions
protected int smoothInUnknownsThreshold
protected boolean smartMutation
protected final Index<String> wordIndex
protected final Index<String> tagIndex
public transient List<IntTaggedWord>[] rulesWithWord
protected transient Set<IntTaggedWord> tags
protected transient Set<IntTaggedWord> words
public ClassicCounter<IntTaggedWord> seenCounter
protected boolean flexiTag
Constructor Detail |
---|
public BaseLexicon(Index<String> wordIndex, Index<String> tagIndex)
public BaseLexicon(Options op, Index<String> wordIndex, Index<String> tagIndex)
Method Detail |
---|
public boolean isKnown(int word)
isKnown
in interface Lexicon
word
- The word as an int index to an Index
public boolean isKnown(String word)
isKnown
in interface Lexicon
word
- The word as a String
public Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc)
word
- The word, represented as an integer in wordIndexloc
- The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't actually
make use of this position information.
tag -> word rule.)
public Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
Implementation note: Expanded sets of possible taggings are calculated dynamically at runtime, so as to reduce the memory used by the lexicon (a space/time tradeoff).
ruleIteratorByWord
in interface Lexicon
word
- The word (as an int)loc
- Its index in the sentence (usually only relevant for unknown words)featureSpec
- Additional word features like morphosyntactic information.
public Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc, String featureSpec)
Lexicon
ruleIteratorByWord
in interface Lexicon
protected void initRulesWithWord()
protected List<IntTaggedWord> treeToEvents(Tree tree, boolean keepTagsAsLabels)
protected List<IntTaggedWord> treeToEvents(Tree tree)
protected List<IntTaggedWord> listToEvents(List<TaggedWord> taggedWords)
protected List<IntTaggedWord> listOfLabeledWordsToEvents(List<LabeledWord> taggedWords)
public void addAll(List<TaggedWord> tagWords)
public void addAll(List<TaggedWord> taggedWords, double weight)
public void trainWithExpansion(Collection<TaggedWord> taggedWords)
public void train(Collection<Tree> trees)
train
in interface Lexicon
trees
- Trees to train onpublic void train(Collection<Tree> trees, boolean keepTagsAsLabels)
public void train(Collection<Tree> trees, double weight)
public void train(Collection<Tree> trees, double weight, boolean keepTagsAsLabels)
protected void addTagging(boolean seen, IntTaggedWord itw, double count)
public float score(IntTaggedWord iTW, int loc, String word)
Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
score
in interface Lexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initialword
- The word itself; useful so we don't have to look it
up in an index
public void tune(Collection<Tree> trees)
public void readData(BufferedReader in) throws IOException
readData
in interface Lexicon
in
- The BufferedReader to read from
IOException
- If any I/O problempublic void writeData(Writer w) throws IOException
writeData
in interface Lexicon
w
- The writer to output to
IOException
- If any I/O problempublic int numRules()
numRules
in interface Lexicon
public void printLexStats()
public double evaluateCoverage(Collection<Tree> trees, Set<String> missingWords, Set<String> missingTags, Set<IntTaggedWord> missingTW)
public int getBaseTag(int tag, TreebankLanguagePack tlp)
public static void main(String[] args)
args
- The command line arguments:
java BaseLexicon treebankPath fileRange unknownWordModel words*public UnknownWordModel getUnknownWordModel()
getUnknownWordModel
in interface Lexicon
public final void setUnknownWordModel(UnknownWordModel uwm)
setUnknownWordModel
in interface Lexicon
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |