edu.stanford.nlp.parser.lexparser
Class Lexicon

java.lang.Object
  extended byedu.stanford.nlp.parser.lexparser.Lexicon
All Implemented Interfaces:
Serializable

public class Lexicon
extends Object
implements Serializable

A class implementing a Lexicon.

See Also:
Serialized Form

Field Summary
static String BOUNDARY
           
static String BOUNDARY_TAG
           
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected static short nullTag
           
protected static int nullWord
           
protected  Set rules
           
protected  List[] rulesWithWord
           
protected  Counter seenCounter
           
protected static long serialVersionUID
           
protected  Set sigs
           
protected  Numberer tagNumberer
           
protected  Set tags
           
static String UNKNOWN_WORD
           
protected  Counter unSeenCounter
           
protected  Numberer wordNumberer
           
protected  Set words
           
 
Constructor Summary
Lexicon()
           
 
Method Summary
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 double evaluateCoverage(Collection trees, Set missingWords, Set missingTags, Set missingTW)
          Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
protected  String getSignature(String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
protected  int getSignatureIndex(int wordIndex, int sentencePosition)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
protected  void initRulesWithWord()
           
 boolean isKnown(int word)
           
 boolean isKnown(String word)
          Checks whether a word is in the lexicon.
 void printLexStats()
           
 void readData(BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
protected  void readObject(ObjectInputStream stream)
           
 Iterator ruleIterator()
           
 Iterator ruleIteratorByWord(int word, int loc)
           
 double score(IntTaggedWord iTW, int loc)
           
protected  double scoreAll(List trees)
           
 String showTags()
           
 Numberer tagNumberer()
           
 void train(Collection trees)
          Trains this lexicon on the Collection of trees.
protected  List treeToEvents(Collection trees)
           
protected  List treeToEvents(Tree tree)
           
 void tune(List trees)
           
 Numberer wordNumberer()
           
 void writeData(Writer w)
          Writes out data from this Object to the Writer w.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNKNOWN_WORD

public static final String UNKNOWN_WORD
See Also:
Constant Field Values

BOUNDARY

public static final String BOUNDARY
See Also:
Constant Field Values

BOUNDARY_TAG

public static final String BOUNDARY_TAG
See Also:
Constant Field Values

rulesWithWord

protected transient List[] rulesWithWord

rules

protected transient Set rules

tags

protected transient Set tags

words

protected transient Set words

sigs

protected transient Set sigs

seenCounter

protected Counter seenCounter

unSeenCounter

protected Counter unSeenCounter

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize

tagNumberer

protected transient Numberer tagNumberer

wordNumberer

protected transient Numberer wordNumberer

serialVersionUID

protected static final long serialVersionUID
See Also:
Constant Field Values
Constructor Detail

Lexicon

public Lexicon()
Method Detail

tagNumberer

public Numberer tagNumberer()

wordNumberer

public Numberer wordNumberer()

ruleIterator

public Iterator ruleIterator()

isKnown

public boolean isKnown(int word)

isKnown

public boolean isKnown(String word)
Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array).

Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public Iterator ruleIteratorByWord(int word,
                                   int loc)

initRulesWithWord

protected void initRulesWithWord()

treeToEvents

protected List treeToEvents(Tree tree)

treeToEvents

protected List treeToEvents(Collection trees)

train

public void train(Collection trees)
Trains this lexicon on the Collection of trees.


addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.


getSignatureIndex

protected int getSignatureIndex(int wordIndex,
                                int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. Caches the last signature index returned.


getSignature

protected String getSignature(String word,
                              int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention match the pattern UNK-.* , which is just assumed to not match any real word.

Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

score

public double score(IntTaggedWord iTW,
                    int loc)

scoreAll

protected double scoreAll(List trees)

tune

public void tune(List trees)

printLexStats

public void printLexStats()

evaluateCoverage

public double evaluateCoverage(Collection trees,
                               Set missingWords,
                               Set missingTags,
                               Set missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results.


showTags

public String showTags()

readObject

protected void readObject(ObjectInputStream stream)
                   throws IOException,
                          ClassNotFoundException
Throws:
IOException
ClassNotFoundException

readData

public void readData(BufferedReader in)
              throws IOException
Populates data in this Lexicon from the character stream given by the Reader r.

Throws:
IOException

writeData

public void writeData(Writer w)
               throws IOException
Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.

Throws:
IOException


Stanford NLP Group