Class EnglishUnknownWordModel

    • Field Detail

      • smartMutation

        protected final boolean smartMutation
      • unknownSuffixSize

        protected final int unknownSuffixSize
      • unknownPrefixSize

        protected final int unknownPrefixSize
      • wordClassesFile

        protected final java.lang.String wordClassesFile
    • Constructor Detail

      • EnglishUnknownWordModel

        public EnglishUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<java.lang.String> wordIndex,
                               Index<java.lang.String> tagIndex)
        This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.
    • Method Detail

      • score

        public float score(IntTaggedWord iTW,
                  int loc,
                  double c_Tseen,
                  double total,
                  double smooth,
                  java.lang.String word)
        Description copied from class: BaseUnknownWordModel
        Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
        Specified by:
        score in interface UnknownWordModel
        score in class BaseUnknownWordModel
        iTW - An IntTaggedWord pairing a word and POS tag
        loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
        c_Tseen - Total count of this tag (on seen words) in training
        total - Total count of word tokens in training
        smooth - Weighting on prior P(T|U) in estimate
        word - The word itself; useful so we don't look it up in the index
        A double valued score, usually - log P(word|tag)
      • getSignatureIndex

        public int getSignatureIndex(int index,
                            int sentencePosition,
                            java.lang.String word)
        Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
        Specified by:
        getSignatureIndex in interface UnknownWordModel
        getSignatureIndex in class BaseUnknownWordModel
      • getSignature

        public java.lang.String getSignature(java.lang.String word,
                                    int loc)
        This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention matches the pattern UNK(-.+)? , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.
        Specified by:
        getSignature in interface UnknownWordModel
        getSignature in class BaseUnknownWordModel
        word - The word to make a signature for
        loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
        A String that is its signature (equivalence class)

Stanford NLP Group