From NLPWiki
Jump to: navigation, search


Stanford Phrasal User Guide

This guide explains how to set up and train a phrase-based Statistical Machine Translation system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools.

Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with:

  • An intuitive feature engineering API
  • Large-scale learning with AdaGrad+FOBOS
  • Fast search with cube pruning
  • Unpruned language modeling with KenLM

This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. The commands in this tutorial are written for bash, but it is relatively easy to adapt them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems.

For background information on statistical machine translation, please see the StatMT site or Philipp Koehn's textbook.

System Requirements

Phrasal is written for Java 1.6+. If your JVM is older than 1.6, then you should upgrade. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).


Questions about Phrasal should be directed to the JavaNLP mailing list.


If you use Stanford Phrasal for research purposes, then we ask that you cite these two papers:

 author = {Spence Green and Daniel Cer and Christopher D. Manning},
 title = {Phrasal: A Toolkit for New Directions in Statistical Machine Translation},
 booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation},
 year = {2014}
 author    = {Green, Spence  and  Wang, Sida  and  Cer, Daniel  and  Manning, Christopher D.},
 title     = {Fast and Adaptive Online Training of Feature-Rich Translation Models},
 booktitle = {Proceedings of ACL},
 year      = {2013}

The first paper describes the core decoder. The second covers the parameter learning algorithm.

If you use the hierarchical lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:

 author = {Michel Galley and Christopher D. Manning},
 title = {A Simple and Effective Hierarchical Phrase Reordering Model},
 booktitle = {Proceedings of EMNLP},
 year = {2008}


Download the latest version of Phrasal and the latest version of Stanford CoreNLP.

The Phrasal download file has the format phrasal.ver.tar.gz where ver is the current version number. Assume that you download to the directory $HOME=/home/me.

The CoreNLP download file has the format where date is the release date. You will need to substitute the correct values for the version you downloaded into the commands below.

Change to your download directory and unpack the archives:

cd $HOME
tar -xzf phrasal.ver.tar.gz
cd phrasal.ver

Add the Phrasal and CoreNLP jar files to your Java CLASSPATH in the following order:

export CLASSPATH=$CLASSPATH:$HOME/stanford-corenlp-full-date/stanford-corenlp-ver.jar:$HOME/phrasal.ver/phrasal.ver.jar

Add the scripts directory to your shell PATH:

export PATH="$PATH":$HOME/phrasal.ver/scripts

Compiling Phrasal

If you change the Phrasal source code, then you'll need to recompile it. The distribution includes an ant script: build.xml. Windows users will need to edit that script to specify a path to Cygwin. Open the script and look for the "TODO" statement. Unix users can simply change to the distribution root and execute:


(Note: in OS X 10.9, ant is no longer available by default. You will need to install it, either directly from the Apache web site or by using systems like HomeBrew or MacPorts.)

Language Modeling

Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models, we recommend that you use KenLM. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.)

Later on in this tutorial we will use KenLM's lmplz tool to build language models.

The Phrasal Java-based language model loader can load the ARPA format files created by lmplz, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a JDK from Oracle. Next, make sure to set the JAVA_HOME environment variable. Place KenLM in $HOME/phrasal.ver/src-cc/kenlm. Finally, compile the loader:

cd $HOME/phrasal.ver/src-cc

In case you are using Mac OS X and you are getting a compilation error caused by a missing jni.h file, replace this line in

$CXX -I. -DKENLM_MAX_ORDER=7 -I$JAVA_HOME/include -Ikenlm/ \
    -I$JAVA_HOME/include/linux kenlm/lm/*.o kenlm/util/*.o \
    kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \
    $extra_flags $RT

with the following line:

    -I"/System/Library/Frameworks/JavaVM.framework/Headers" -Ikenlm/ \
    -I$JAVA_HOME/include/linux kenlm/lm/*.o kenlm/util/*.o \
    kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \
    $extra_flags $RT

To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file.

Word Alignments

For word alignment, we recommend the Berkeley Aligner. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as Giza++.

To run symmetrization heuristics like grow-diag during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter writeGIZA to the Berkeley configuration file. Here is an example configuration file.

Data Download and Pre-processing

Download the training, development, and test data from the WMT 2013 shared task site:

Unpack the files in a convenient location. Assume that we will build a system for French (fr) to English (en) translation, so we'll need the following files:

Parallel data:


Monolingual data:

  • europarl-v7.en

Development data:

  • newstest2011.en

Test data:

  • newstest2012.en


  • tokenizer.perl
  • lowercase.perl

To prepare the data, first tokenize and lowercase the parallel, monolingual, development, and test data using the scripts, e.g.:

cat | tokenizer.perl -l en | lowercase.perl >

Each processed file should have the .tok extension.

For the bitext, run the corpus cleaner included with Phrasal:

This command will produce two files:


Word Alignment

Align the bitext with the Berkeley aligner (or another aligner like GIZA++). The Berkeley aligner distribution contains an example configuration file.

In either case, you'll need the A3 files created by the aligner for phrase extraction later on. Assume that you have these files:

Language Model Estimation

Estimate a language model from both the monolingual data and the target-side of the parallel data:

lmplz -o 4 < europarl-v7.en.tok >
build_binary trie 4gm.bin

System Training

Create a new directory for the MT system and copy over the system tuning files:

mkdir $HOME/
cd $HOME/
cp $HOME/phrasal.ver/example/example.vars fr-en.vars
cp $HOME/phrasal.ver/example/example.ini fr-en.ini
cp $HOME/phrasal.ver/example/example.binwts fr-en.initial.binwts

You might also want to copy over the language model:

cp 4gm.bin $HOME/

Open the vars file and read the comments. If you've followed the instructions carefully until now, then all of the filenames and paths should match. However, you should verify the files and paths before proceeding.

Tuning and evaluation consists of eight stages, which are configured for convenience in a script. To see the phases, run:

If your PATH is configured correctly, you should see the following output:

Usage: var_file steps ini_file sys_name
Use dashes and commas in the steps specification e.g. 1-3,6
Step definitions:
 1  Extract phrases from dev set
 2  Run tuning
 3  Extract phrases from test set
 4  Decode test set
 5  Output results file
 6  Generate a learning curve from an online run

To train and evaluate a system, which we will call "baseline," run this command: fr-en.vars 1-6 fr-en.ini baseline

An explanation of each step along with associated parameters in the vars file follows.

Extract phrases from dev set

Purpose: Extract translation rules from the parallel data for the development set.

Phrase extraction parameters:

EXTRACT_SET -- Specifies the bitext files and symmetrization heuristic (default heuristic is grow-diag.
THREADS_EXTRACT -- Number of threads to use for phrase extraction.
MAX_PHRASE_LEN -- Maximum source phrase length.
OTHER_EXTRACT_OPTS -- Other options that are described in
LO_ARGS -- Parameters for the lexicalized re-ordering model.

Dev set parameters

TUNE_SET_NAME -- The name of the tuning/dev set (e.g., newstest2011).
TUNE_SET -- The actual filename of the dev set (e.g.,
TUNE_REF -- The reference file of the dev set (e.g., newstest2011.en.tok).

Run tuning

Purpose: Estimate model weights from the dev set.


TUNE_MODE -- The tuning algorithm. Here we'll use online. The Advanced features section describes how to enable batch tuning with MERT .
INITIAL_WTS -- The initial weights file.
TUNE_NBEST -- The n-best list size.
ONLINE_OPTS -- Algorithm-specific options for the online tuner.

Extract phrases from test set

Purpose: Extract translation rules from the parallel data for the test set.


DECODE_SET_NAME -- Name of the test set (e.g., newstest2012).
DECODE_SET -- The actual filename of the test set (e.g.,
NBEST -- n-best list size to generate (optional)

Decode test set

Purpose: Decode (translate) the test set.

Output results file

Purpose: Evaluate translation quality using BLEU. This step outputs a .bleu file for the test set.


REFDIR -- The base path of reference directory.

The reference directory must have a particular format. For example, here is how to make a reference directory for newstest2012:

mkdir -p $HOME/refs/newstest2012
cp newstest2012.en.tok $HOME/refs/newstest2012/ref0

The reference filenames must have the prefix ref. Multiple references can be specified by naming the files e.g. ref0,ref1,ref2 etc.

Generate a learning curve from an online run

Purpose: Evaluate weight vectors produced during each training iteration on the test set.

This step outputs a .learn-curve file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.

Running the above steps correctly on our system results in the following learning curve on newstest2012:

iter    weight vector                           newstest2012 BLEU
0    23.87
1    24.18
2    24.16
3    24.09
4    24.14
5    24.16
6    24.18
7    24.09

The system is very stable on held-out data, achieving a maximum BLEU score of 24.18 at the end of iteration 1. This results compares fairly well to the WMT results given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes:

28 min. -- newstest2011 rule extraction
26 min. -- newstest2012 rule extraction
32 min. -- tuning on newstest2011
 1 min. -- decoding newstest2012
87 min. total system train/tune time


The script generates three log files:

  • $HOME/
  • $HOME/
  • $HOME/

The first log contains output from step 2. You can see the tuning objective function score by searching for "BLEU", e.g.:

grep BLEU

BLEU scores should increase from one epoch to the next.

The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.

You can also inspect the inspect the intermediate weight files generated by the learning algorithm:


Advanced/Additional Features

This section describes features for advanced users who wish to maximize translation quality.

Word Classes

Phrasal 3.4 comes with several featurizers that use word classes. To use these featurizers you need a mapping from words to classes for the source and the target language. Phrasal includes an implemenation of a very fast word clustering algorithm that allows you to train word classes on corpora containing billions of tokens within a few a hours which is up to three orders of magnitude faster than other popular tools:

Algorithm (implementation) threads time (min.sec)
Brown (wcluster) 1 1023.39
Clark (cluster_neyessen) 1 890.11
Och (mkcls) 1 199.04
Our algorithm 8 2.42

Wallclock time to generate a mapping from a vocabulary of 63k English words (3.7M tokens) to 512 classes.

If you use our implementation for research purposes, then we ask that you cite this paper:

 author = {Spence Green and Daniel Cer and Christopher D. Manning},
 title = {An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation},
 booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation},
 year = {2014}

To train the world classes, run on all the data for each language: 10g 1 -name en_cls europarl-v7.en.tok > en.cls 10g 1 -name fr_cls > fr.cls

To use the word classes in any featurizer you have to add the mappings to your ini file:



Feature Engineering API

Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Like cdec, feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection.

The Feature API Tutorial describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.

Baseline "dense" features are located in

Examples of discriminative "sparse" features can be found in

Feature templates are via reflection in the .ini file as follows:


More details on Phrasal discriminative features

Statistical Significance Testing

Phrasal includes an implementation of the permutation test described by Riezler and Maxwell. To obtain p-values for a pair of system outputs, run:

java [bleu|ter] reference_prefix system1 system2

where reference_prefix is a common filename prefix for multiple references.

Advanced Parameters

Additional phrase extraction options are passed via the OTHER_EXTRACT_OPTS in the vars file to See the usage and javadocs in that package for a description of the options.

Learning options are passed via ONLINE_OPTS in the vars file to See the usage and javadocs in that package for a description of the options.

Phrasal decoder options are specified in the .ini file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in

MERT (Batch) Training

Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:

 SEED -- Random seed for MERT
 N_STARTING_POINTS -- Number of random starting points.
 TUNE_NBEST -- n-best list size
 OBJECTIVE -- objective function (e.g., bleu)
 THREADS -- Number of MERT threads
 OPT_FLAGS -- MERT configuration parameters (see scripts/ for a full list of options)

More details on MERT options

Arabic-English Translation

Arabic-English is a common research language pair in the United States. The steps in this tutorial can be used to build an Ar-En system with the exception of Arabic pre-processing and segmentation. The Stanford NLP group provides a free tokenizer/segmenter for Arabic. Run it on the Arabic side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.

Chinese-English Translation

Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a free tokenizer/segmenter for Chinese. Run it on the Chinese side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.