Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention in modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.
Stanford Arabic Word Segmenter - Apply ATB clitic segmentation and orthographic normalization to raw Arabic text. The segmenter is based on a conditional random fields (CRF) sequence classifier so decoding is very fast. This segmenter is appropriate for processing large amounts of text (like machine translation corpora).
Below is a list of our publications that either deal with Arabic directly or use it as an experimental subject.
Spence Green, Sida Wang, Daniel Cer, and Christopher D. Manning. 2013. Fast and Adaptive Online Training of Feature-Rich Translation Models. In ACL. [pdf]
Spence Green, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. Parsing Models for Identifying Multiword Expressions. In Computational Linguistics. [pdf]
Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. [pdf]
Spence Green and Christopher D. Manning. 2010. Better Arabic Parsing: Baselines, Evaluations, and Analysis. In COLING.
Spence Green, Michel Galley, and Christopher D. Manning. 2010. Improved Models of Distortion Cost for Statistical Machine Translation. In NAACL-HLT 2010.
Spence Green, Conal Sathi, and Christopher D. Manning. 2009. NP subject detection in verb-initial Arabic clauses. In Proceedings of the Third Workshop on Computational Approaches to Arabic Script-based Languages (CAASL3).
Michel Galley, Spence Green, Daniel Cer, Pi-Chuan Chang, Christopher D. Manning (2009).
Stanford University's Arabic-to-English Statistical Machine Translation System for the 2009 NIST Evaluation.
The 2009 NIST Open Machine Translation Evaluation Meeting.
Ottawa, Canada. [pdf]
Michel Galley and Christopher D. Manning. 2008. A Simple and Effective Hierarchical Phrase Reordering Model. In ACL.
For more information, please contact Spence Green at