The grey "GALE ROSETTA" notes are only for people involved in that project; they don't apply to regular users.
Much of the information here is also applicable to the Arabic part of speech tagger, such as discussion of word segmentation and tag sets.
The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). You must provide input to the parser that is tokenized in this way or the resulting parses will be terrible. We do now have a software component for segmenting Arabic,but you have to download and run it first; it isn't included in the parser (see at the end of this answer). The Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn't written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus:
-escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper
]
-LRB-
and -RRB-
'
and "
), not as curly quotes or LaTeX-style
quotes (unlike the Penn English Treebank).There are some tools available that can do the necessary clitic segmentation:
The present release provides a grammar
(arabicFactored.ser.gz
) for real Arabic
(arabicFactored.ser.gz
), for which the default encoding is
UTF-8, but for which another encoding (such as legacy Arabic encodings)
can be specified on the commmand line with
the -encoding charset
flag.
(Previous releases also provided grammars
for the
Buckwalter encoding
of Arabic in ASCII
(either atbP3FactoredBuckwalter.ser.gz
or
arabicFactoredBuckwalter.ser.gz
and
atb3FactoredBuckwalter.ser.gz
, depending on the parser
release). They may return if there is interest.)
The parsers are trained on unvocalized Arabic. One grammar
(atbP3FactoredBuckwalter.ser.gz
or atb3FactoredBuckwalter.ser.gz
) is
trained on input represented exactly as it is found in the Penn Arabic
Treebank.
The other grammars
(arabicFactored.ser.gz
and
arabicFactoredBuckwalter.ser.gz
) are
trained on a more normalized form of Arabic. This form deletes the
tatweel character and other diacritics beyond the short vowel markers which
are sometimes not written
(Alef with hamza or madda becomes simply Alef, and Alef maksura becomes
Yaa), and prefers ASCII characters (Arabic punctuation and number
characters are mapped to corresponding ASCII characters). Your accuracy
will suffer unless you normalize text in this way, because words are
recognized simply based on string identity.
[GALE ROSETTA: This is precisely the mapping that the IBM
ar_normalize_v5.pl
script does for you.]
The parser uses an "augmented Bies" tag set. The so-called
"Bies mapping" maps down the full morphological analyses from the
Buckwalter analyzer that appear in the LDC Arabic Treebanks to a subset
of the POS tags used in the Penn English Treebank (but some with different meanings).
We augment this set to represent which words have the determiner "Al" (ال)
cliticized to them. These extra tags start with "DT", and appear for
all parts of speech that can be preceded by "Al", so we have DTNN, DTCD, etc.
This is an early definition of the Bies mapping.
For something more up-to-date with recent updates of the Arabic Treebank tag taxonomy,
it is also be useful to look at
the recent
documentation and articles. In particular, the Bies mapping is
defined in the file http://catalog.ldc.upenn.edu/docs/LDC2010T13/atb1-v4.1-taglist-conversion-to-PennPOS-forrelease.lisp
, included with recent ATB releases. This revised version now includes a few new tags that are not in the English PTB tag set (NOUN_QUANT, ADJ_NUM, and VN). We also include them now.
The set used in the Penn Arabic Treebank. See the original Penn Arabic Treebank Guidelines, or, better, the up-to-date Penn Arabic Treebank Guidelines.
The parser download does not include components for normalizing or segmenting Arabic text. You might look at the Stanford Word Segmenter download, or the segmentation tools from CADIM, such as the one available on Mona Diab's homepage (but note that if they also separate off the "Al" (ال) clitic, then you will need to glue it back on in a postprocessing step). [GALE ROSETTA: IBM has an ATB segmenter and a Perl script that does the appropriate normalization. Their segmenter marks proclitics and enclitics with '#' and '+'. These need to be removed for parsing, but we do provide an escaper which does this.]
Two of the 3 grammars
(arabicFactored.ser.gz
and
arabicFactoredBuckwalter.ser.gz
) are
trained on the training data of the "Mona Diab"
(a.k.a. "Johns Hopkins 2005 Workshop") data splits
of parts 1-3 of the Penn Arabic Treebank.
The other grammar
(atbP3FactoredBuckwalter.ser.gz
or atb3FactoredBuckwalter.ser.gz
) is
trained on a decimation of the ATBp3 treebank data. (That is,
heading sentence-by-sentence through the trees, you put 8 sentences in
training, 1 in development, and then 1 in test, and then repeat.) This
is the data split that has been used at UPenn (see S. Kulick et al., TLT 2006).
The table below shows the parser's performance on the development test data sets, as defined above. Here, "factF1" is the Parseval F1 of Labeled Precision and Recall, and "factDA" is the dependency accuracy of the factored parser (based on untyped dependencies imputed from "head rules"). This is for sentences of 40 words or less, and discarding "sentences" (the bylines at the start of articles) that are just an "X" constituent. The performance of the UTF-8 and Buckwalter grammars is basically identical, because only the character encoding is different (and so it is not shown separately). Note that we do get value from the extra data in parts 1 and 2 of the ATB (more value than it first appears, because a decimation data split is always advantageous to a parser), and that dependency accuracy is relatively better than constituency accuracy (we regard this as evidence of inconsistent constituency annotation in the ATB).
factF1 factDA factEx pcfgF1 depDA factTA num arabicFactored.ser.gz 77.44 84.05 13.27 69.49 80.07 96.09 1567 atb3FactoredBuckwalter.ser.gz 75.76 83.08 14.41 68.09 77.75 95.91 951
Sure! These parsing examples are for the 3 test files supplied with the parser. They assume you are sitting in the root directory of the parser distribution. [GALE ROSETTA: The last illustrates the removal of the IBM "+" and "#" marks mentioned earlier.]
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactored.ser.gz arabic-onesent-utf8.txt Loading parser from serialized file arabicFactored.ser.gz ... done [14.3 sec]. Parsing file: arabic-onesent-utf8.txt with 1 sentences. Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل . (ROOT (S (CC و) (VP (VBD نشر) (NP (DTNN العدل)) (PP (IN من) (NP (NN خلال) (NP (NN قضاء) (JJ مستقل))))) (PUNC .))) Parsed file: arabic-onesent-utf8.txt [1 sentences]. Parsed 8 words in 1 sentences (5.15 wds/sec; 0.64 sents/sec). $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactoredBuckwalter.ser.gz arabic-onesent-buck.txt Loading parser from serialized file arabicFactoredBuckwalter.ser.gz ... done [9.4 sec]. Parsing file: arabic-onesent-buck.txt with 1 sentences. Parsing [sent. 1 len. 8]: w n$r AlEdl mn xlAl qDA' mstql . (ROOT (S (CC w) (VP (VBD n$r) (NP (DTNN AlEdl)) (PP (IN mn) (NP (NN xlAl) (NP (NN qDA') (JJ mstql))))) (PUNC .))) Parsed file: arabic-onesent-buck.txt [1 sentences]. Parsed 8 words in 1 sentences (7.92 wds/sec; 0.99 sents/sec). $ cat arabic-onesent-ibm-utf8.txt و# نشر العدل من خلال قضاء مستقل . $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper arabicFactored.ser.gz arabic-onesent-ibm-utf8.txt Loading parser from serialized file arabicFactored.ser.gz ... done [9.3 sec]. Parsing file: arabic-onesent-ibm-utf8.txt with 1 sentences. Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل . (ROOT (S (CC و) (VP (VBD نشر) (NP (DTNN العدل)) (PP (IN من) (NP (NN خلال) (NP (NN قضاء) (JJ مستقل))))) (PUNC .))) Parsed file: arabic-onesent-ibm-utf8.txt [1 sentences]. Parsed 8 words in 1 sentences (5.87 wds/sec; 0.73 sents/sec).
You can ask for dependencies output, with the -outputFormat
dependencies
option. At present, there is no typed dependencies
(grammatical relations) analysis available for Arabic, and so asking for
typedDependencies
will throw an
UnsupportedOperationException
.
(Caution: With UTF-8 Arabic, the
dependencies output may appear to be reversed, because dependencies are
being displayed right-to-left (depending on the bidi support of your
terminal program). But they are correct, really.)
Much of the Arabic-specific code, including the
ArabicHeadFinder
and the
ArabicTreebankLanguagePack
is defined inside the
edu.stanford.nlp.trees.international.arabic
package. But
parser-specific code and the top level entry to Arabic language
resources is found in the edu.stanford.nlp.parser.lexparser
package. There, you find the classes
ArabicTreebankParserParams
and ArabicUnknownWordSignatures
.
For general questions, see also the
Parser FAQ. Please send any other
questions or feedback, or extensions and bugfixes to
parser-user@lists.stanford.edu
or
parser-support@lists.stanford.edu
.