Class ArabicUnknownWordModel

    • Field Detail

      • smartMutation

        protected final boolean smartMutation
      • unknownSuffixSize

        protected final int unknownSuffixSize
      • unknownPrefixSize

        protected final int unknownPrefixSize
    • Constructor Detail

      • ArabicUnknownWordModel

        public ArabicUnknownWordModel(Options op,
                              Lexicon lex,
                              Index<java.lang.String> wordIndex,
                              Index<java.lang.String> tagIndex)
        This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.
    • Method Detail

      • score

        public float score(IntTaggedWord iTW,
                  int loc,
                  double c_Tseen,
                  double total,
                  double smooth,
                  java.lang.String word)
        Description copied from class: BaseUnknownWordModel
        Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
        Specified by:
        score in interface UnknownWordModel
        score in class BaseUnknownWordModel
        iTW - An IntTaggedWord pairing a word and POS tag
        loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
        c_Tseen - Total count of this tag (on seen words) in training
        total - Total count of word tokens in training
        smooth - Weighting on prior P(T|U) in estimate
        word - The word itself; useful so we don't look it up in the index
        A double valued score, usually - log P(word|tag)
      • getSignatureIndex

        public int getSignatureIndex(int index,
                            int sentencePosition,
                            java.lang.String word)
        Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
        Specified by:
        getSignatureIndex in interface UnknownWordModel
        getSignatureIndex in class BaseUnknownWordModel
      • getSignature

        public java.lang.String getSignature(java.lang.String word,
                                    int loc)
        6-9 were added for Arabic. 6 looks for the prefix Al- (and knows that Buckwalter uses various symbols as letters), while 7 just looks for numbers and last letter. 8 looks for Al-, looks for several useful suffixes, and tracks the first letter of the word. (note that the first letter seems a bit more informative than the last letter, overall.) 9 tries to build on 8, but avoiding some of its perceived flaws: really it was using the first AND last letter.
        Specified by:
        getSignature in interface UnknownWordModel
        getSignature in class BaseUnknownWordModel
        word - The word to make a signature for
        loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
        A String that is its signature (equivalence class)
      • getUnknownLevel

        public int getUnknownLevel()
        Description copied from interface: UnknownWordModel
        Get the level of equivalence classing for the model. One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be queried with this method.
        Specified by:
        getUnknownLevel in interface UnknownWordModel
        getUnknownLevel in class BaseUnknownWordModel
        The current level of unknown word equivalence classing

Stanford NLP Group