Skip navigation links
edu.stanford.nlp.parser.lexparser

Class EnglishUnknownWordModel

    • Field Detail

      • smartMutation

        protected final boolean smartMutation
      • unknownSuffixSize

        protected final int unknownSuffixSize
      • unknownPrefixSize

        protected final int unknownPrefixSize
      • wordClassesFile

        protected final String wordClassesFile
    • Method Detail

      • score

        public float score(IntTaggedWord iTW,
                           int loc,
                           double c_Tseen,
                           double total,
                           double smooth,
                           String word)
        Description copied from class: BaseUnknownWordModel
        Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
        Specified by:
        score in interface UnknownWordModel
        Overrides:
        score in class BaseUnknownWordModel
        Parameters:
        iTW - An IntTaggedWord pairing a word and POS tag
        loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
        c_Tseen - Total count of this tag (on seen words) in training
        total - Total count of word tokens in training
        smooth - Weighting on prior P(T|U) in estimate
        word - The word itself; useful so we don't look it up in the index
        Returns:
        A double valued score, usually - log P(word|tag)
      • getSignature

        public String getSignature(String word,
                                   int loc)
        This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention matches the pattern UNK(-.+)? , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.
        Specified by:
        getSignature in interface UnknownWordModel
        Overrides:
        getSignature in class BaseUnknownWordModel
        Parameters:
        word - The word to make a signature for
        loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
        Returns:
        A String that is its signature (equivalence class)

Stanford NLP Group