Difference between revisions of "Software/Phrasal"

From NLPWiki
Jump to: navigation, search
(Stanford Phrasal User Guide)
(31 intermediate revisions by 5 users not shown)
Line 1: Line 1:
= Phrasal User Guide =
+
= Stanford Phrasal User Guide =
  
 
This guide explains how to set up and train a phrase-based Statistical Machine
 
This guide explains how to set up and train a phrase-based Statistical Machine
Translation system using <b>Phrasal</b>. It offers step-by-step
+
Translation system using <b>[http://nlp.stanford.edu/phrasal/ Phrasal]</b>. It offers step-by-step
 
instructions to download, install, configure, and run the Phrasal decoder and
 
instructions to download, install, configure, and run the Phrasal decoder and
 
its related support tools.
 
its related support tools.
  
Phrasal is designed for fast training of large-scale, discriminative translation models with:
+
Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with:
 
* An intuitive feature engineering API
 
* An intuitive feature engineering API
 
* Large-scale learning with AdaGrad+FOBOS
 
* Large-scale learning with AdaGrad+FOBOS
Line 12: Line 12:
 
* Unpruned language modeling with KenLM
 
* Unpruned language modeling with KenLM
  
This guide assumes some familiarity with Unix-like [http://en.wikipedia.org/wiki/Unix_shell command-line interpreters] (or shells), such as [http://www.gnu.org/software/bash/ bash] for Linux and Mac OS X and [http://www.cygwin.com Cygwin] for Microsoft Windows. The commands in this tutorial are written for <tt>bash</tt>, but it is relatively easy to adapt
+
This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like [http://en.wikipedia.org/wiki/Unix_shell command-line interpreters] (or shells), such as [http://www.gnu.org/software/bash/ bash] for Linux and Mac OS X and [http://www.cygwin.com Cygwin] for Microsoft Windows. The commands in this tutorial are written for <tt>bash</tt>, but it is relatively easy to adapt
them to other shells.
+
them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems.
  
 
For background information on statistical machine translation, please see [http://www.statmt.org/ the StatMT site] or [http://www.amazon.com/Statistical-Machine-Translation-Philipp-Koehn/dp/0521874157/ Philipp Koehn's textbook].
 
For background information on statistical machine translation, please see [http://www.statmt.org/ the StatMT site] or [http://www.amazon.com/Statistical-Machine-Translation-Philipp-Koehn/dp/0521874157/ Philipp Koehn's textbook].
  
 +
== System Requirements ==
 
Phrasal is written for Java 1.6+. If your JVM is older than 1.6, then you should [http://www.java.com upgrade]. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).
 
Phrasal is written for Java 1.6+. If your JVM is older than 1.6, then you should [http://www.java.com upgrade]. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).
  
 +
== Support ==
 
Questions about Phrasal should be directed to the [http://mailman.stanford.edu/mailman/listinfo/java-nlp-user JavaNLP mailing list].
 
Questions about Phrasal should be directed to the [http://mailman.stanford.edu/mailman/listinfo/java-nlp-user JavaNLP mailing list].
  
If you use Phrasal for research purposes, then we ask that you cite these two papers:
+
== Citations ==
 +
 
 +
If you use Stanford Phrasal for research purposes, then we ask that you cite these two papers:
 
  @inproceedings{Cer2010,
 
  @inproceedings{Cer2010,
 
   author = {Cer, Daniel and Galley, Michel and Jurafsky, Daniel and Manning, Christopher D.},
 
   author = {Cer, Daniel and Galley, Michel and Jurafsky, Daniel and Manning, Christopher D.},
Line 38: Line 42:
 
The first paper describes the core decoder. The second covers the parameter learning algorithm.
 
The first paper describes the core decoder. The second covers the parameter learning algorithm.
  
If you enable the lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:
+
If you use the lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:
 
  @inproceedings{Galley2008,
 
  @inproceedings{Galley2008,
 
   author = {Michel Galley and Christopher D. Manning},
 
   author = {Michel Galley and Christopher D. Manning},
Line 52: Line 56:
 
The Phrasal download file has the format <tt>phrasal.ver.tar.gz</tt> where <tt>ver</tt> is the current version number. Assume that you download to the directory <tt>$HOME=/home/me</tt>.
 
The Phrasal download file has the format <tt>phrasal.ver.tar.gz</tt> where <tt>ver</tt> is the current version number. Assume that you download to the directory <tt>$HOME=/home/me</tt>.
  
The CoreNLP download file has the format <tt>stanford-corenlp-full-date.zip</tt> where <tt>date</tt> is the release date.
+
The CoreNLP download file has the format <tt>stanford-corenlp-full-date.zip</tt> where <tt>date</tt> is the release date. You will need to substitute the correct values for the version you downloaded into the commands below.
  
 
Change to your download directory and unpack the archives:
 
Change to your download directory and unpack the archives:
Line 66: Line 70:
 
  export PATH="$PATH":$HOME/phrasal.ver/scripts
 
  export PATH="$PATH":$HOME/phrasal.ver/scripts
  
Now build the C++ code for KenLM:
+
=== Compiling Phrasal ===
 +
If you change the Phrasal source code, then you'll need to recompile it. The distribution includes an ant script:  <tt>build.xml</tt>. Windows users will need to edit that script to specify a path to Cygwin. Open the script and look for the "TODO" statement. Unix users can simply change to the distribution root and execute:
 +
ant
 +
(Note: in OS X 10.9, ant is no longer available by default. You will need to install it, either directly from the Apache [http://ant.apache.org/bindownload.cgi web site] or by using systems like HomeBrew or MacPorts.)
 +
 
 +
=== Language Modeling ===
 +
 
 +
Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models,  we recommend that you use  [http://kheafield.com/code/kenlm/ KenLM]. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.)
 +
 
 +
Later on in this tutorial we will use KenLM's <tt>lmplz</tt> tool to build language models.
 +
 
 +
The Phrasal Java-based language model loader can load the ARPA format files created by <tt>lmplz</tt>, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a [http://www.oracle.com/technetwork/java/javase/downloads/‎ JDK from Oracle]. Next, make sure to set the  [http://docs.oracle.com/cd/E19182-01/820-7851/inst_cli_jdk_javahome_t/index.html <tt>JAVA_HOME</tt> environment variable]. Place KenLM in $HOME/phrasal.ver/src-cc/kenlm. Finally, compile the loader:
 
  cd $HOME/phrasal.ver/src-cc
 
  cd $HOME/phrasal.ver/src-cc
 
  ./compile_JNI.sh
 
  ./compile_JNI.sh
  
=== KenLM ===
+
To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file.
 
+
To build language models, you'll need to [http://kheafield.com/code/kenlm/ download and install KenLM]. Both the KenLM site and download package contain detailed installation instructions.
+
  
We will use <tt>lmplz</tt> later on to build language models.
+
=== Word Alignments ===
  
=== Berkeley aligner ===
+
For word alignment, we recommend the [http://code.google.com/p/berkeleyaligner/ Berkeley Aligner]. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as [https://code.google.com/p/giza-pp/ Giza++].
  
For word alignment, we recommend [http://code.google.com/p/berkeleyaligner/ Berkeley Aligner]. The download contains installation instructions.
+
To run symmetrization heuristics like <tt>grow-diag</tt> during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter <tt>writeGIZA</tt> to the Berkeley configuration file. Here is an  [http://nlp.stanford.edu/software/phrasal/example_files/ucb-align.conf example configuration file].
  
 
== Data Download and Pre-processing ==
 
== Data Download and Pre-processing ==
Line 154: Line 167:
 
If your PATH is configured correctly, you should see the following output:
 
If your PATH is configured correctly, you should see the following output:
 
  Usage: phrasal.sh var_file steps ini_file sys_name
 
  Usage: phrasal.sh var_file steps ini_file sys_name
 
 
  Use dashes and commas in the steps specification e.g. 1-3,6
 
  Use dashes and commas in the steps specification e.g. 1-3,6
 
 
  Step definitions:
 
  Step definitions:
 
   1  Extract phrases from dev set
 
   1  Extract phrases from dev set
Line 220: Line 231:
 
Purpose: Evaluate weight vectors produced during each training iteration on the test set.
 
Purpose: Evaluate weight vectors produced during each training iteration on the test set.
  
This step outputs a <tt>.learn-curve</t> file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.
+
This step outputs a <tt>.learn-curve</tt> file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.
 +
 
 +
Running the above steps correctly on our system results in the following learning curve on newstest2012:
 +
 
 +
iter    weight vector                          newstest2012 BLEU
 +
-----------------------------------------------------------------
 +
0      newstest2011.example.online.0.binwts    23.87
 +
1      newstest2011.example.online.1.binwts    24.18
 +
2      newstest2011.example.online.2.binwts    24.16
 +
3      newstest2011.example.online.3.binwts    24.09
 +
4      newstest2011.example.online.4.binwts    24.14
 +
5      newstest2011.example.online.5.binwts    24.16
 +
6      newstest2011.example.online.6.binwts    24.18
 +
7      newstest2011.example.online.7.binwts    24.09
 +
 
 +
The system is very stable on held-out data, achieving a '''maximum BLEU score of 24.18''' at the end of iteration 1. This results compares fairly well to the [http://matrix.statmt.org WMT results] given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes:
 +
 
 +
28 min. -- newstest2011 rule extraction
 +
26 min. -- newstest2012 rule extraction
 +
32 min. -- tuning on newstest2011
 +
  1 min. -- decoding newstest2012
 +
-------
 +
87 min. total system train/tune time
  
 
=== Troubleshooting ===
 
=== Troubleshooting ===
Line 226: Line 259:
 
The <tt>phrasal-train-tune.sh</tt> script generates three log files:
 
The <tt>phrasal-train-tune.sh</tt> script generates three log files:
 
* $HOME/phrasal.fr-en/newstest2011.baseline.online.log
 
* $HOME/phrasal.fr-en/newstest2011.baseline.online.log
* $HOME/phrasal.fr-en/logs/newstest2011.baseline.stdout.log
+
* $HOME/phrasal.fr-en/logs/newstest2011.baseline.stdout
* $HOME/phrasal.fr-en/logs/newstest2012.newstest2011.baseline.stdout.log
+
* $HOME/phrasal.fr-en/logs/newstest2012.newstest2011.baseline.stdout
  
The first log contains output from step 3. You can see the tuning objective function score by searching for "BLEU", e.g.:
+
The first log contains output from step 2. You can see the tuning objective function score by searching for "BLEU", e.g.:
 
  grep BLEU newstest2011.baseline.online.log
 
  grep BLEU newstest2011.baseline.online.log
  
 
BLEU scores should increase from one epoch to the next.
 
BLEU scores should increase from one epoch to the next.
  
The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.
+
The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java <tt>FileNotFoundException</tt> information in these logs.
 +
 
 +
You can also inspect the inspect the intermediate weight files generated by the learning algorithm:
 +
  java edu.stanford.nlp.mt.tools.PrintWeights newstest2011.baseline.online.2.binwts
  
 
== Advanced/Additional Features ==
 
== Advanced/Additional Features ==
Line 242: Line 278:
 
=== Feature Engineering API ===
 
=== Feature Engineering API ===
  
Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Unlike other MT systems like cdec and Moses, Phrasal loads new feature templates via reflection, so you don't need to recompile the whole system every time you change a feature.
+
Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. [http://www.cdec-decoder.org/documentation/ext-ff.html Like cdec], feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection. 
  
The Feature API Tutorial---see the <tt>/doc</tt> directory in the download package---describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.  
+
The [http://nlp.stanford.edu/software/phrasal/example_files/phrasal-featureapi.pdf Feature API Tutorial] describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.  
  
 
Baseline "dense" features are located in <tt>edu.stanford.nlp.mt.decoder.feat</tt>.
 
Baseline "dense" features are located in <tt>edu.stanford.nlp.mt.decoder.feat</tt>.
Line 254: Line 290:
 
  edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature()
 
  edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature()
  
=== Decoder Parameters ===
+
[[More details on Phrasal discriminative features]]
 +
 
 +
=== Statistical Significance Testing ===
 +
Phrasal includes an implementation of the permutation test described by [http://acl.ldc.upenn.edu/W/W05/W05-0908.pdf‎ Riezler and Maxwell]. To obtain p-values for a pair of system outputs, run:
 +
java edu.stanford.nlp.mt.tools.SignificanceTest [bleu|ter] reference_prefix system1 system2
 +
where <tt>reference_prefix</tt> is a common filename prefix for multiple references.
  
Phrasal decoder options are specified in the <tt>.ini</tt> file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc.
+
=== Advanced Parameters ===
  
For the full list of options, see the javadocs in <tt>edu.stanford.nlp.mt.Phrasal</tt>.
+
Additional phrase extraction options are passed via the <tt>OTHER_EXTRACT_OPTS</tt> in the vars file to <tt>edu.stanford.nlp.mt.train.PhraseExtract</tt>. See the usage and javadocs in that package for a description of the options.
  
=== Online Learning Parameters ===
+
Learning options are passed via <tt>ONLINE_OPTS</tt> in the vars file to <tt>edu.stanford.nlp.mt.tune.OnlineTuner</tt>. See the usage and javadocs in that package for a description of the options.
  
See the usage and javadocs in <tt>edu.stanford.nlp.mt.tune.OnlineTuner</tt> for a description of the parameters supported by the online tuning algorithm.
+
Phrasal decoder options are specified in the <tt>.ini</tt> file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in <tt>edu.stanford.nlp.mt.Phrasal</tt>.
  
 
=== MERT (Batch) Training ===
 
=== MERT (Batch) Training ===
  
 
Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:
 
Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:
 +
  SEED -- Random seed for MERT
 +
  N_STARTING_POINTS -- Number of random starting points.
 +
  TUNE_NBEST -- n-best list size
 +
  OBJECTIVE -- objective function (e.g., bleu)
 +
  THREADS -- Number of MERT threads
 +
  OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options)
  
SEED -- Random seed for MERT
+
[[More details on MERT options]]
N_STARTING_POINTS -- Number of random starting points.
+
TUNE_NBEST -- n-best list size
+
OBJECTIVE -- objective function (e.g., bleu)
+
THREADS -- Number of MERT threads
+
OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options)
+
  
 
=== Arabic-English Translation ===
 
=== Arabic-English Translation ===
Line 283: Line 325:
 
=== Chinese-English Translation ===
 
=== Chinese-English Translation ===
  
Arabic-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a [http://nlp.stanford.edu/software/segmenter.shtml free tokenizer/segmenter for Chinese]. Run it on the Chinese side of the parallel data and you should be ready to go.
+
Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a [http://nlp.stanford.edu/software/segmenter.shtml free tokenizer/segmenter for Chinese]. Run it on the Chinese side of the parallel data and you should be ready to go.
  
 
A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in [http://nlp.stanford.edu/software/corenlp.shtml Stanford CoreNLP].
 
A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in [http://nlp.stanford.edu/software/corenlp.shtml Stanford CoreNLP].

Revision as of 13:39, 25 January 2014

Contents

Stanford Phrasal User Guide

This guide explains how to set up and train a phrase-based Statistical Machine Translation system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools.

Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with:

  • An intuitive feature engineering API
  • Large-scale learning with AdaGrad+FOBOS
  • Fast search with cube pruning
  • Unpruned language modeling with KenLM

This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. The commands in this tutorial are written for bash, but it is relatively easy to adapt them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems.

For background information on statistical machine translation, please see the StatMT site or Philipp Koehn's textbook.

System Requirements

Phrasal is written for Java 1.6+. If your JVM is older than 1.6, then you should upgrade. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).

Support

Questions about Phrasal should be directed to the JavaNLP mailing list.

Citations

If you use Stanford Phrasal for research purposes, then we ask that you cite these two papers:

@inproceedings{Cer2010,
 author = {Cer, Daniel and Galley, Michel and Jurafsky, Daniel and Manning, Christopher D.},
 title = {Phrasal: a toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features},
 booktitle = {Proceedings of the NAACL HLT 2010 Demonstration Session},
 year = {2010}
}
@InProceedings{Green2013,
 author    = {Green, Spence  and  Wang, Sida  and  Cer, Daniel  and  Manning, Christopher D.},
 title     = {Fast and Adaptive Online Training of Feature-Rich Translation Models},
 booktitle = {Proceedings of ACL},
 year      = {2013}
}

The first paper describes the core decoder. The second covers the parameter learning algorithm.

If you use the lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:

@inproceedings{Galley2008,
 author = {Michel Galley and Christopher D. Manning},
 title = {A Simple and Effective Hierarchical Phrase Reordering Model},
 booktitle = {Proceedings of EMNLP},
 year = {2008}
}

Installation

Download the latest version of Phrasal and the latest version of Stanford CoreNLP.

The Phrasal download file has the format phrasal.ver.tar.gz where ver is the current version number. Assume that you download to the directory $HOME=/home/me.

The CoreNLP download file has the format stanford-corenlp-full-date.zip where date is the release date. You will need to substitute the correct values for the version you downloaded into the commands below.

Change to your download directory and unpack the archives:

cd $HOME
tar -xzf phrasal.ver.tar.gz
unzip stanford-corenlp-full-date.zip
cd phrasal.ver

Add the Phrasal and CoreNLP jar files to your Java CLASSPATH in the following order:

export CLASSPATH=$CLASSPATH:$HOME/stanford-corenlp-full-date/stanford-corenlp-ver.jar:$HOME/phrasal.ver/phrasal.ver.jar

Add the scripts directory to your shell PATH:

export PATH="$PATH":$HOME/phrasal.ver/scripts

Compiling Phrasal

If you change the Phrasal source code, then you'll need to recompile it. The distribution includes an ant script: build.xml. Windows users will need to edit that script to specify a path to Cygwin. Open the script and look for the "TODO" statement. Unix users can simply change to the distribution root and execute:

ant

(Note: in OS X 10.9, ant is no longer available by default. You will need to install it, either directly from the Apache web site or by using systems like HomeBrew or MacPorts.)

Language Modeling

Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models, we recommend that you use KenLM. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.)

Later on in this tutorial we will use KenLM's lmplz tool to build language models.

The Phrasal Java-based language model loader can load the ARPA format files created by lmplz, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a JDK from Oracle. Next, make sure to set the JAVA_HOME environment variable. Place KenLM in $HOME/phrasal.ver/src-cc/kenlm. Finally, compile the loader:

cd $HOME/phrasal.ver/src-cc
./compile_JNI.sh

To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file.

Word Alignments

For word alignment, we recommend the Berkeley Aligner. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as Giza++.

To run symmetrization heuristics like grow-diag during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter writeGIZA to the Berkeley configuration file. Here is an example configuration file.

Data Download and Pre-processing

Download the training, development, and test data from the WMT 2013 shared task site:

Unpack the files in a convenient location. Assume that we will build a system for French (fr) to English (en) translation, so we'll need the following files:

Parallel data:

  • europarl-v7.fr-en.en
  • europarl-v7.fr-en.fr

Monolingual data:

  • europarl-v7.en

Development data:

  • newstest2011.fr
  • newstest2011.en

Test data:

  • newstest2012.fr
  • newstest2012.en

Scripts:

  • tokenizer.perl
  • lowercase.perl

To prepare the data, first tokenize and lowercase the parallel, monolingual, development, and test data using the scripts, e.g.:

cat europarl-v7.fr-en.en | tokenizer.perl -l en | lowercase.perl > europarl-v7.fr-en.en.tok

Each processed file should have the .tok extension.

For the bitext, run the corpus cleaner included with Phrasal:

clean-corpus.py europarl-v7.fr-en.en.tok europarl-v7.fr-en.fr.tok

This command will produce two files:

  • europarl-v7.fr-en.en.tok.filt.gz
  • europarl-v7.fr-en.fr.tok.filt.gz

Word Alignment

Align the bitext with the Berkeley aligner (or another aligner like GIZA++). The Berkeley aligner distribution contains an example configuration file.

In either case, you'll need the A3 files created by the aligner for phrase extraction later on. Assume that you have these files:

training.fr-en.A3
training.en-fr.A3

Language Model Estimation

Estimate a language model from both the monolingual data and the target-side of the parallel data:

lmplz -o 4 < europarl-v7.fr-en.en.tok europarl-v7.en.tok > 4gm.arpa
build_binary trie 4gm.arpa 4gm.bin

System Training

Create a new directory for the MT system and copy over the system tuning files:

mkdir $HOME/phrasal.fr-en
cd $HOME/phrasal.fr-en
cp $HOME/phrasal.ver/example/example.vars fr-en.vars
cp $HOME/phrasal.ver/example/example.ini fr-en.ini
cp $HOME/phrasal.ver/example/example.binwts fr-en.initial.binwts

You might also want to copy over the language model:

cp 4gm.bin $HOME/phrasal.fr-en

Open the vars file and read the comments. If you've followed the instructions carefully until now, then all of the filenames and paths should match. However, you should verify the files and paths before proceeding.

Tuning and evaluation consists of eight stages, which are configured for convenience in a script. To see the phases, run:

phrasal.sh

If your PATH is configured correctly, you should see the following output:

Usage: phrasal.sh var_file steps ini_file sys_name
Use dashes and commas in the steps specification e.g. 1-3,6
Step definitions:
 1  Extract phrases from dev set
 2  Run tuning
 3  Extract phrases from test set
 4  Decode test set
 5  Output results file
 6  Generate a learning curve from an online run

To train and evaluate a system, which we will call "baseline," run this command:

phrasal.sh fr-en.vars 1-6 fr-en.ini baseline

An explanation of each step along with associated parameters in the vars file follows.

Extract phrases from dev set

Purpose: Extract translation rules from the parallel data for the development set.

Phrase extraction parameters:

EXTRACT_SET -- Specifies the bitext files and symmetrization heuristic (default heuristic is grow-diag.
THREADS_EXTRACT -- Number of threads to use for phrase extraction.
MAX_PHRASE_LEN -- Maximum source phrase length.
OTHER_EXTRACT_OPTS -- Other options that are described in edu.stanford.nlp.mt.train.PhraseExtract
LO_ARGS -- Parameters for the lexicalized re-ordering model.

Dev set parameters

TUNE_SET_NAME -- The name of the tuning/dev set (e.g., newstest2011).
TUNE_SET -- The actual filename of the dev set (e.g., newstest2011.fr.tok).
TUNE_REF -- The reference file of the dev set (e.g., newstest2011.en.tok).

Run tuning

Purpose: Estimate model weights from the dev set.

Parameters:

TUNE_MODE -- The tuning algorithm. Here we'll use online. The Advanced features section describes how to enable batch tuning with MERT .
INITIAL_WTS -- The initial weights file.
TUNE_NBEST -- The n-best list size.
ONLINE_OPTS -- Algorithm-specific options for the online tuner.

Extract phrases from test set

Purpose: Extract translation rules from the parallel data for the test set.

Parameters:

DECODE_SET_NAME -- Name of the test set (e.g., newstest2012).
DECODE_SET -- The actual filename of the test set (e.g., newstest2012.fr.tok).
NBEST -- n-best list size to generate (optional)

Decode test set

Purpose: Decode (translate) the test set.

Output results file

Purpose: Evaluate translation quality using BLEU. This step outputs a .bleu file for the test set.

Parameters:

REFDIR -- The base path of reference directory.

The reference directory must have a particular format. For example, here is how to make a reference directory for newstest2012:

mkdir -p $HOME/refs/newstest2012
cp newstest2012.en.tok $HOME/refs/newstest2012/ref0

The reference filenames must have the prefix ref. Multiple references can be specified by naming the files e.g. ref0,ref1,ref2 etc.

Generate a learning curve from an online run

Purpose: Evaluate weight vectors produced during each training iteration on the test set.

This step outputs a .learn-curve file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.

Running the above steps correctly on our system results in the following learning curve on newstest2012:

iter    weight vector                           newstest2012 BLEU
-----------------------------------------------------------------
0       newstest2011.example.online.0.binwts    23.87
1       newstest2011.example.online.1.binwts    24.18
2       newstest2011.example.online.2.binwts    24.16
3       newstest2011.example.online.3.binwts    24.09
4       newstest2011.example.online.4.binwts    24.14
5       newstest2011.example.online.5.binwts    24.16
6       newstest2011.example.online.6.binwts    24.18
7       newstest2011.example.online.7.binwts    24.09

The system is very stable on held-out data, achieving a maximum BLEU score of 24.18 at the end of iteration 1. This results compares fairly well to the WMT results given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes:

28 min. -- newstest2011 rule extraction
26 min. -- newstest2012 rule extraction
32 min. -- tuning on newstest2011
 1 min. -- decoding newstest2012
-------
87 min. total system train/tune time

Troubleshooting

The phrasal-train-tune.sh script generates three log files:

  • $HOME/phrasal.fr-en/newstest2011.baseline.online.log
  • $HOME/phrasal.fr-en/logs/newstest2011.baseline.stdout
  • $HOME/phrasal.fr-en/logs/newstest2012.newstest2011.baseline.stdout

The first log contains output from step 2. You can see the tuning objective function score by searching for "BLEU", e.g.:

grep BLEU newstest2011.baseline.online.log

BLEU scores should increase from one epoch to the next.

The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.

You can also inspect the inspect the intermediate weight files generated by the learning algorithm:

 java edu.stanford.nlp.mt.tools.PrintWeights newstest2011.baseline.online.2.binwts

Advanced/Additional Features

This section describes features for advanced users who wish to maximize translation quality.

Feature Engineering API

Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Like cdec, feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection.

The Feature API Tutorial describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.

Baseline "dense" features are located in edu.stanford.nlp.mt.decoder.feat.

Examples of discriminative "sparse" features can be found in edu.stanford.nlp.mt.decoder.feat.sparse.

Feature templates are via reflection in the .ini file as follows:

[additional-featurizers]
edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature()

More details on Phrasal discriminative features

Statistical Significance Testing

Phrasal includes an implementation of the permutation test described by Riezler and Maxwell. To obtain p-values for a pair of system outputs, run:

java edu.stanford.nlp.mt.tools.SignificanceTest [bleu|ter] reference_prefix system1 system2

where reference_prefix is a common filename prefix for multiple references.

Advanced Parameters

Additional phrase extraction options are passed via the OTHER_EXTRACT_OPTS in the vars file to edu.stanford.nlp.mt.train.PhraseExtract. See the usage and javadocs in that package for a description of the options.

Learning options are passed via ONLINE_OPTS in the vars file to edu.stanford.nlp.mt.tune.OnlineTuner. See the usage and javadocs in that package for a description of the options.

Phrasal decoder options are specified in the .ini file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in edu.stanford.nlp.mt.Phrasal.

MERT (Batch) Training

Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:

 SEED -- Random seed for MERT
 N_STARTING_POINTS -- Number of random starting points.
 TUNE_NBEST -- n-best list size
 OBJECTIVE -- objective function (e.g., bleu)
 THREADS -- Number of MERT threads
 OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options)

More details on MERT options

Arabic-English Translation

Arabic-English is a common research language pair in the United States. The steps in this tutorial can be used to build an Ar-En system with the exception of Arabic pre-processing and segmentation. The Stanford NLP group provides a free tokenizer/segmenter for Arabic. Run it on the Arabic side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.

Chinese-English Translation

Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a free tokenizer/segmenter for Chinese. Run it on the Chinese side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.