Stanford Arabic Parser IAQ

Questions

  1. What tokenization of Arabic does the parser assume?
  2. What character encoding do you assume?
  3. What characters are encoded?
  4. What POS tag set does the parser use?
  5. What phrasal category set does the parser use?
  6. What's not in the box?
  7. What data are the parsers trained on?
  8. How well do the parsers work?
  9. Can you give me some examples of how to use the parser for Arabic?
  10. Can you get dependencies output from the Arabic parser?
  11. Where does the Arabic-specific source code live?

Questions with answers

  1. What tokenization of Arabic does the parser assume?

    The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). You must provide input to the parser that is tokenized in this way or the resulting parses will be terrible. We do not currently provide a component that tokenizes Arabic and splits off clitics; the Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn't written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus:

    There are some tools available from elsewhere that can do the necessary clitic segmentation:

  2. What character encoding do you assume?

    Grammars are provided both for real Arabic (arabicFactored.ser.gz), for which the default encoding is UTF-8, but another encoding can be specified on the commmand line with the -encoding charset flag, and for the Buckwalter encoding of Arabic in ASCII (arabicFactoredBuckwalter.ser.gz and atb3FactoredBuckwalter.ser.gz).

  3. What characters are encoded?

    The parsers are trained on unvocalized Arabic. One grammar (atb3FactoredBuckwalter.ser.gz) is trained on input represented exactly as it is found in the Penn Arabic Treebank. The other two grammars (arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on a more normalized form of Arabic. This form deletes the tatweel character and other diacritics beyond the short vowel markers which are sometimes not written (Alef with hamza or madda becomes simply Alef, and Alef maksura becomes Yaa), and prefers ASCII characters (Arabic punctuation and number characters are mapped to corresponding ASCII characters). Your accuracy will suffer unless you normalize text in this way, because words are recognized simply based on string identity. [GALE ROSETTA: This is precisely the mapping that the IBM ar_normalize_v5.pl script does for you.]

  4. What POS tag set does the parser use?

    The parser users an "augmented Bies" tag set. The so-called Bies mapping maps down the full morphological analyses from the Buckwalter analyzer to a subset of the POS tags used in the Penn English Treebank (but some with different meanings). We augment this set to represent which words have the determiner "Al" (ال) cliticized to them. These extra tags start with "DT", and appear for all parts of speech that can be preceded by "Al", so we have DTNN, DTCD, etc.

  5. What phrasal category set does the parser use?

    The set used in the Penn Arabic Treebank. See the Penn Arabic Treebank Guidelines.

  6. What's not in the box?

    We at present provide no components for normalizing or segmenting Arabic text. You might look at segmentation tools from CADIM, such as the one available on Mona Diab's homepage, but note that if they also separate off the "Al" (ال) clitic, then you will need to glue it back on in a postprocessing step. [GALE ROSETTA: IBM has an ATB segmenter and a Perl script that does the appropriate normalization. Their segmenter marks proclitics and enclitics with '#' and '+'. These need to be removed for parsing, but we do provide an escaper which does this.]

  7. What data are the parsers trained on?

    Two of the 3 grammars (arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on the training data of the "Mona Diab" (a.k.a. "Johns Hopkins 2005 Workshop") data splits of parts 1-3 of the Penn Arabic Treebank. We apply automatic transformations to map ATBp2 to use (mostly) the same POS tag set as ATBp1v3 and ATBp3v2. The other grammar (atb3FactoredBuckwalter.ser.gz) is trained on a decimation of the ATBp3v2 treebank data. (That is, heading sentence-by-sentence through the trees, you put 8 sentences in training, 1 in development, and then 1 in test, and then repeat.) This is the data split that has been used at UPenn (see S. Kulick et al., TLT 2006).

  8. How well do the parsers work?

    The table below shows the parser's performance on the development test data sets, as defined above. Here, "factF1" is the Parseval F1 of Labeled Precision and Recall, and "factDA" is the dependency accuracy of the factored parser (based on untyped dependencies imputed from "head rules"). This is for sentences of 40 words or less, and discarding "sentences" (the bylines at the start of articles) that are just an "X" constituent. The performance of the UTF-8 and Buckwalter grammars is basically identical, because only the character encoding is different (and so it is not shown separately). Note that we do get value from the extra data in parts 1 and 2 of the ATB (more value than it first appears, because a decimation data split is always advantageous to a parser), and that dependency accuracy is relatively better than constituency accuracy (we regard this as evidence of inconsistent constituency annotation in the ATB).

                                   factF1   factDA  factEx  pcfgF1  depDA   factTA   num
    arabicFactored.ser.gz          77.44    84.05   13.27   69.49   80.07   96.09   1567
    atb3FactoredBuckwalter.ser.gz  75.76    83.08   14.41   68.09   77.75   95.91    951
    
  9. Can you give me some examples of how to use the parser for Arabic?

    Sure! These parsing examples are for the 3 test files supplied with the parser. They assume you are sitting in the root directory of the parser distribution. [GALE ROSETTA: The last illustrates the removal of the IBM "+" and "#" marks mentioned earlier.]

    $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactored.ser.gz arabic-onesent-utf8.txt
    Loading parser from serialized file arabicFactored.ser.gz ... done [14.3 sec].
    Parsing file: arabic-onesent-utf8.txt with 1 sentences.
    Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
    (ROOT
      (S (CC و)
        (VP (VBD نشر)
          (NP (DTNN العدل))
          (PP (IN من)
            (NP (NN خلال)
              (NP (NN قضاء) (JJ مستقل)))))
        (PUNC .)))
    
    Parsed file: arabic-onesent-utf8.txt [1 sentences].
    Parsed 8 words in 1 sentences (5.15 wds/sec; 0.64 sents/sec).
    $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactoredBuckwalter.ser.gz arabic-onesent-buck.txt
    Loading parser from serialized file arabicFactoredBuckwalter.ser.gz ... done [9.4 sec].
    Parsing file: arabic-onesent-buck.txt with 1 sentences.
    Parsing [sent. 1 len. 8]: w n$r AlEdl mn xlAl qDA' mstql .
    (ROOT
      (S (CC w)
        (VP (VBD n$r)
          (NP (DTNN AlEdl))
          (PP (IN mn)
            (NP (NN xlAl)
              (NP (NN qDA') (JJ mstql)))))
        (PUNC .)))
    
    Parsed file: arabic-onesent-buck.txt [1 sentences].
    Parsed 8 words in 1 sentences (7.92 wds/sec; 0.99 sents/sec).
    $ cat arabic-onesent-ibm-utf8.txt 
    و# نشر العدل من خلال قضاء مستقل .
    $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper arabicFactored.ser.gz arabic-onesent-ibm-utf8.txt
    Loading parser from serialized file arabicFactored.ser.gz ... done [9.3 sec].
    Parsing file: arabic-onesent-ibm-utf8.txt with 1 sentences.
    Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
    (ROOT
      (S (CC و)
        (VP (VBD نشر)
          (NP (DTNN العدل))
          (PP (IN من)
            (NP (NN خلال)
              (NP (NN قضاء) (JJ مستقل)))))
        (PUNC .)))
    
    Parsed file: arabic-onesent-ibm-utf8.txt [1 sentences].
    Parsed 8 words in 1 sentences (5.87 wds/sec; 0.73 sents/sec).
    
  10. Can you get dependencies output from the Arabic parser?

    You can ask for dependencies output, with the -outputFormat dependencies option. At present, there is no typed dependencies (grammatical relations) analysis available for Arabic, and so asking for typedDependencies will throw an UnsupportedOperationException. (Caution: With UTF-8 Arabic, the dependencies output may appear to be reversed, because dependencies are being displayed right-to-left (depending on the bidi support of your terminal program). But they are correct, really.)

  11. Where does the Arabic-specific source code live?

    Much of the Arabic-specific code, including the ArabicHeadFinder and the ArabicTreebankLanguagePack is defined inside the edu.stanford.nlp.trees.international.arabic package. But parser-specific code and the top level entry to Arabic language resources is found in the edu.stanford.nlp.parser.lexparser package. There, you find the classes ArabicTreebankParserParams and ArabicUnknownWordSignatures.

For general questions, see also the Parser FAQ. Please send any other questions or feedback, or extensions and bugfixes to parser-support@lists.stanford.edu.