Arabic Natural Language Processing

Overview

Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention in modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.

Software

Stanford Arabic Parser - The full distribution includes a model trained on the most recent releases of the first three parts of the Penn Arabic Treebank (ATB). These corpora contain newswire text. Arabic-specific parsing instructions, a FAQ, and a recommended train/dev/test split of the ATB are also available. The parser expects segmented text as input. If you want to parse raw text, then you must pre-process it with the Stanford Arabic Word Segmenter.
Stanford Arabic Word Segmenter - Apply ATB clitic segmentation and orthographic normalization to raw Arabic text. The segmenter is based on a conditional random fields (CRF) sequence classifier so decoding is very fast. This segmenter is appropriate for processing large amounts of text (like machine translation corpora).
Stanford Arabic Part of Speech Tagger - The full distribution comes with a model trained on the ATB.
Tregex/TregexGUI - A regular expression package for parse trees. Useful for browsing and searching the ATB. Supports Unicode (UTF-8) input and display.

People

Papers

Below is a list of our publications that either deal with Arabic directly or use it as an experimental subject.

Will Monroe, Spence Green and Christopher D. Manning. 2014. Word Segmentation of Informal Arabic with Domain Adaptation. In Association for Computational Linguistics (ACL) [pdf]
Spence Green, Sida Wang, Daniel Cer, and Christopher D. Manning. 2013. Fast and Adaptive Online Training of Feature-Rich Translation Models. In ACL. [pdf]
Spence Green, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. Parsing Models for Identifying Multiword Expressions. In Computational Linguistics. [pdf]
Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. [pdf]
Spence Green and Christopher D. Manning. 2010. Better Arabic Parsing: Baselines, Evaluations, and Analysis. In COLING. [pdf]
Spence Green, Michel Galley, and Christopher D. Manning. 2010. Improved Models of Distortion Cost for Statistical Machine Translation. In NAACL-HLT 2010. [pdf]
Spence Green, Conal Sathi, and Christopher D. Manning. 2009. NP subject detection in verb-initial Arabic clauses. In Proceedings of the Third Workshop on Computational Approaches to Arabic Script-based Languages (CAASL3). [pdf]
Michel Galley, Spence Green, Daniel Cer, Pi-Chuan Chang, Christopher D. Manning (2009). Stanford University's Arabic-to-English Statistical Machine Translation System for the 2009 NIST Evaluation. The 2009 NIST Open Machine Translation Evaluation Meeting. Ottawa, Canada. [pdf]
Michel Galley and Christopher D. Manning. 2008. A Simple and Effective Hierarchical Phrase Reordering Model. In ACL. [pdf]
Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2004. Automatic tagging of arabic text: from raw text to base phrase chunks. In NAACL-HLT. [pdf]

Contact Information

For more information, please contact Spence Green at