|
|
The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). You must provide input to the parser that is tokenized in this way or the resulting parses will be terrible. We do not currently provide a component that tokenizes Arabic and splits off clitics; the Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn't written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus:
-escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper]
-LRB- and -RRB-' and "), not as curly quotes or LaTeX-style
quotes (unlike the Penn English Treebank).There are some tools available from elsewhere that can do the necessary clitic segmentation:
Grammars are provided both for real Arabic
(arabicFactored.ser.gz), for which the default encoding is
UTF-8, but another encoding can be specified on the commmand line with
the -encoding charset flag,
and for the
Buckwalter encoding
of Arabic in ASCII
(arabicFactoredBuckwalter.ser.gz and
atb3FactoredBuckwalter.ser.gz).
The parsers are trained on unvocalized Arabic. One grammar
(atb3FactoredBuckwalter.ser.gz) is
trained on input represented exactly as it is found in the Penn Arabic
Treebank.
The other two grammars
(arabicFactored.ser.gz and
arabicFactoredBuckwalter.ser.gz) are
trained on a more normalized form of Arabic. This form deletes the
tatweel character and other diacritics beyond the short vowel markers which
are sometimes not written
(Alef with hamza or madda becomes simply Alef, and Alef maksura becomes
Yaa), and prefers ASCII characters (Arabic punctuation and number
characters are mapped to corresponding ASCII characters). Your accuracy
will suffer unless you normalize text in this way, because words are
recognized simply based on string identity.
[GALE ROSETTA: This is precisely the mapping that the IBM
ar_normalize_v5.pl script does for you.]
The parser users an "augmented Bies" tag set. The so-called Bies mapping maps down the full morphological analyses from the Buckwalter analyzer to a subset of the POS tags used in the Penn English Treebank (but some with different meanings). We augment this set to represent which words have the determiner "Al" (ال) cliticized to them. These extra tags start with "DT", and appear for all parts of speech that can be preceded by "Al", so we have DTNN, DTCD, etc.
The set used in the Penn Arabic Treebank. See the Penn Arabic Treebank Guidelines.
We at present provide no components for normalizing or segmenting Arabic text. You might look at segmentation tools from CADIM, such as the one available on Mona Diab's homepage, but note that if they also separate off the "Al" (ال) clitic, then you will need to glue it back on in a postprocessing step. [GALE ROSETTA: IBM has an ATB segmenter and a Perl script that does the appropriate normalization. Their segmenter marks proclitics and enclitics with '#' and '+'. These need to be removed for parsing, but we do provide an escaper which does this.]
Two of the 3 grammars
(arabicFactored.ser.gz and
arabicFactoredBuckwalter.ser.gz) are
trained on the training data of the "Mona Diab"
(a.k.a. "Johns Hopkins 2005 Workshop") data splits
of parts 1-3 of the Penn Arabic Treebank. We apply automatic
transformations to map ATBp2 to use (mostly) the same POS tag set as
ATBp1v3 and ATBp3v2.
The other grammar
(atb3FactoredBuckwalter.ser.gz) is
trained on a decimation of the ATBp3v2 treebank data. (That is,
heading sentence-by-sentence through the trees, you put 8 sentences in
training, 1 in development, and then 1 in test, and then repeat.) This
is the data split that has been used at UPenn (see S. Kulick et al., TLT 2006).
The table below shows the parser's performance on the development test data sets, as defined above. Here, "factF1" is the Parseval F1 of Labeled Precision and Recall, and "factDA" is the dependency accuracy of the factored parser (based on untyped dependencies imputed from "head rules"). This is for sentences of 40 words or less, and discarding "sentences" (the bylines at the start of articles) that are just an "X" constituent. The performance of the UTF-8 and Buckwalter grammars is basically identical, because only the character encoding is different (and so it is not shown separately). Note that we do get value from the extra data in parts 1 and 2 of the ATB (more value than it first appears, because a decimation data split is always advantageous to a parser), and that dependency accuracy is relatively better than constituency accuracy (we regard this as evidence of inconsistent constituency annotation in the ATB).
factF1 factDA factEx pcfgF1 depDA factTA num
arabicFactored.ser.gz 77.44 84.05 13.27 69.49 80.07 96.09 1567
atb3FactoredBuckwalter.ser.gz 75.76 83.08 14.41 68.09 77.75 95.91 951
Sure! These parsing examples are for the 3 test files supplied with the parser. They assume you are sitting in the root directory of the parser distribution. [GALE ROSETTA: The last illustrates the removal of the IBM "+" and "#" marks mentioned earlier.]
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactored.ser.gz arabic-onesent-utf8.txt
Loading parser from serialized file arabicFactored.ser.gz ... done [14.3 sec].
Parsing file: arabic-onesent-utf8.txt with 1 sentences.
Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
(ROOT
(S (CC و)
(VP (VBD نشر)
(NP (DTNN العدل))
(PP (IN من)
(NP (NN خلال)
(NP (NN قضاء) (JJ مستقل)))))
(PUNC .)))
Parsed file: arabic-onesent-utf8.txt [1 sentences].
Parsed 8 words in 1 sentences (5.15 wds/sec; 0.64 sents/sec).
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactoredBuckwalter.ser.gz arabic-onesent-buck.txt
Loading parser from serialized file arabicFactoredBuckwalter.ser.gz ... done [9.4 sec].
Parsing file: arabic-onesent-buck.txt with 1 sentences.
Parsing [sent. 1 len. 8]: w n$r AlEdl mn xlAl qDA' mstql .
(ROOT
(S (CC w)
(VP (VBD n$r)
(NP (DTNN AlEdl))
(PP (IN mn)
(NP (NN xlAl)
(NP (NN qDA') (JJ mstql)))))
(PUNC .)))
Parsed file: arabic-onesent-buck.txt [1 sentences].
Parsed 8 words in 1 sentences (7.92 wds/sec; 0.99 sents/sec).
$ cat arabic-onesent-ibm-utf8.txt
و# نشر العدل من خلال قضاء مستقل .
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper arabicFactored.ser.gz arabic-onesent-ibm-utf8.txt
Loading parser from serialized file arabicFactored.ser.gz ... done [9.3 sec].
Parsing file: arabic-onesent-ibm-utf8.txt with 1 sentences.
Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
(ROOT
(S (CC و)
(VP (VBD نشر)
(NP (DTNN العدل))
(PP (IN من)
(NP (NN خلال)
(NP (NN قضاء) (JJ مستقل)))))
(PUNC .)))
Parsed file: arabic-onesent-ibm-utf8.txt [1 sentences].
Parsed 8 words in 1 sentences (5.87 wds/sec; 0.73 sents/sec).
You can ask for dependencies output, with the -outputFormat
dependencies option. At present, there is no typed dependencies
(grammatical relations) analysis available for Arabic, and so asking for
typedDependencies will throw an
UnsupportedOperationException.
(Caution: With UTF-8 Arabic, the
dependencies output may appear to be reversed, because dependencies are
being displayed right-to-left (depending on the bidi support of your
terminal program). But they are correct, really.)
Much of the Arabic-specific code, including the
ArabicHeadFinder and the
ArabicTreebankLanguagePack is defined inside the
edu.stanford.nlp.trees.international.arabic package. But
parser-specific code and the top level entry to Arabic language
resources is found in the edu.stanford.nlp.parser.lexparser
package. There, you find the classes
ArabicTreebankParserParams and ArabicUnknownWordSignatures.
For general questions, see also the
Parser FAQ. Please send any other
questions or feedback, or extensions and bugfixes to
parser-support@lists.stanford.edu.
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · ScalaNLP · machines · Wiki |
Site design by Bill MacCartney |