edu.stanford.nlp.parser.lexparser
Interface Lexicon

All Superinterfaces:
Serializable
All Known Implementing Classes:
BaseLexicon

public interface Lexicon
extends Serializable

An interface for lexicons interfacing to lexparser. Its primary responsibility is to provide a conditional probability P(word|tag), which is fulfilled by the {#score} method. Inside the lexparser, Strings are interned and tags and words are usually represented as integers.

Author:
Galen Andrew

Field Summary
static String BOUNDARY
           
static String BOUNDARY_TAG
           
static String UNKNOWN_WORD
           
 
Method Summary
 void finishTraining()
          Done collecting statistics for the lexicon.
 UnknownWordModel getUnknownWordModel()
           
 void initializeTraining(double numTrees)
          Start training this lexicon on the expected number of trees.
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(String word)
          Checks whether a word is in the lexicon.
 int numRules()
          Returns the number of rules (tag rewrites as word) in the Lexicon.
 void readData(BufferedReader in)
          Read the lexicon from the BufferedReader in the format written by writeData.
 Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, String featureSpec)
          Get an iterator over all rules (pairs of (word, POS)) for this word.
 Iterator<IntTaggedWord> ruleIteratorByWord(String word, int loc, String featureSpec)
          Same thing, but with a string that needs to be translated by the lexicon's word index
 float score(IntTaggedWord iTW, int loc, String word)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 void setUnknownWordModel(UnknownWordModel uwm)
           
 void train(Collection<Tree> trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection<Tree> trees, double weight)
           
 void train(List<TaggedWord> sentence, double weight)
           
 void train(Tree tree, double weight)
           
 void trainUnannotated(List<TaggedWord> sentence, double weight)
          Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.
 void writeData(Writer w)
          Write the lexicon in human-readable format to the Writer.
 

Field Detail

UNKNOWN_WORD

static final String UNKNOWN_WORD
See Also:
Constant Field Values

BOUNDARY

static final String BOUNDARY
See Also:
Constant Field Values

BOUNDARY_TAG

static final String BOUNDARY_TAG
See Also:
Constant Field Values
Method Detail

isKnown

boolean isKnown(int word)
Checks whether a word is in the lexicon.

Parameters:
word - The word as an int
Returns:
Whether the word is in the lexicon

isKnown

boolean isKnown(String word)
Checks whether a word is in the lexicon.

Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                           int loc,
                                           String featureSpec)
Get an iterator over all rules (pairs of (word, POS)) for this word.

Parameters:
word - The word, represented as an integer in Index
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
featureSpec - Additional word features like morphosyntactic information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

ruleIteratorByWord

Iterator<IntTaggedWord> ruleIteratorByWord(String word,
                                           int loc,
                                           String featureSpec)
Same thing, but with a string that needs to be translated by the lexicon's word index


numRules

int numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. This method assumes that the lexicon has been initialized.

Returns:
The number of rules (tag rewrites as word) in the Lexicon.

initializeTraining

void initializeTraining(double numTrees)
Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)


train

void train(Collection<Tree> trees)
Trains this lexicon on the Collection of trees. Can be called more than once with different collections of trees.

Parameters:
trees - Trees to train on

train

void train(Collection<Tree> trees,
           double weight)

train

void train(Tree tree,
           double weight)

train

void train(List<TaggedWord> sentence,
           double weight)

trainUnannotated

void trainUnannotated(List<TaggedWord> sentence,
                      double weight)
Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.


finishTraining

void finishTraining()
Done collecting statistics for the lexicon.


score

float score(IntTaggedWord iTW,
            int loc,
            String word)
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)

Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
word - The word itself; useful so we don't have to look it up in an index
Returns:
A score, usually, log P(word|tag)

writeData

void writeData(Writer w)
               throws IOException
Write the lexicon in human-readable format to the Writer. (An optional operation.)

Parameters:
w - The writer to output to
Throws:
IOException - If any I/O problem

readData

void readData(BufferedReader in)
              throws IOException
Read the lexicon from the BufferedReader in the format written by writeData. (An optional operation.)

Parameters:
in - The BufferedReader to read from
Throws:
IOException - If any I/O problem

getUnknownWordModel

UnknownWordModel getUnknownWordModel()

setUnknownWordModel

void setUnknownWordModel(UnknownWordModel uwm)


Stanford NLP Group