Difference between revisions of "Software/Phrasal"
SpenceGreen (Talk | contribs) m (→Generate a learning curve from an online run) |
SpenceGreen (Talk | contribs) (→Stanford Phrasal User Guide) |
||
(56 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
− | = Phrasal User Guide = | + | = Stanford Phrasal User Guide = |
+ | |||
This guide explains how to set up and train a phrase-based Statistical Machine | This guide explains how to set up and train a phrase-based Statistical Machine | ||
− | Translation system using <b>Phrasal</b>. It offers step-by-step | + | Translation system using <b>[http://nlp.stanford.edu/phrasal/ Phrasal]</b>. It offers step-by-step |
instructions to download, install, configure, and run the Phrasal decoder and | instructions to download, install, configure, and run the Phrasal decoder and | ||
its related support tools. | its related support tools. | ||
− | Phrasal is designed for fast training of large-scale, discriminative translation models with: | + | Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with: |
* An intuitive feature engineering API | * An intuitive feature engineering API | ||
* Large-scale learning with AdaGrad+FOBOS | * Large-scale learning with AdaGrad+FOBOS | ||
Line 12: | Line 13: | ||
* Unpruned language modeling with KenLM | * Unpruned language modeling with KenLM | ||
− | This guide assumes some familiarity with Unix-like [http://en.wikipedia.org/wiki/Unix_shell command-line interpreters] (or shells), such as [http://www.gnu.org/software/bash/ bash] for Linux and Mac OS X and [http://www.cygwin.com Cygwin] for Microsoft Windows. The commands in this tutorial are written for <tt>bash</tt>, but it is relatively easy to adapt | + | This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like [http://en.wikipedia.org/wiki/Unix_shell command-line interpreters] (or shells), such as [http://www.gnu.org/software/bash/ bash] for Linux and Mac OS X and [http://www.cygwin.com Cygwin] for Microsoft Windows. The commands in this tutorial are written for <tt>bash</tt>, but it is relatively easy to adapt |
− | them to other shells. | + | them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems. |
For background information on statistical machine translation, please see [http://www.statmt.org/ the StatMT site] or [http://www.amazon.com/Statistical-Machine-Translation-Philipp-Koehn/dp/0521874157/ Philipp Koehn's textbook]. | For background information on statistical machine translation, please see [http://www.statmt.org/ the StatMT site] or [http://www.amazon.com/Statistical-Machine-Translation-Philipp-Koehn/dp/0521874157/ Philipp Koehn's textbook]. | ||
− | Phrasal is written for Java 1. | + | == System Requirements == |
+ | Phrasal is written for Java 1.8+. If your JVM is older than 1.8, then you should [http://www.java.com upgrade]. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3). | ||
+ | To compile Phrasal and automatically fetch its dependencies, you need [http://gradle.org/ Gradle]. | ||
+ | |||
+ | == Support == | ||
Questions about Phrasal should be directed to the [http://mailman.stanford.edu/mailman/listinfo/java-nlp-user JavaNLP mailing list]. | Questions about Phrasal should be directed to the [http://mailman.stanford.edu/mailman/listinfo/java-nlp-user JavaNLP mailing list]. | ||
− | If you use Phrasal for research purposes, then we ask that you cite these two papers: | + | == Citations == |
− | @inproceedings{ | + | |
− | author = { | + | If you use Stanford Phrasal for research purposes, then we ask that you cite these two papers: |
− | title = {Phrasal: | + | @inproceedings{Green2014, |
− | booktitle = { | + | author = {Spence Green and Daniel Cer and Christopher D. Manning}, |
− | year = { | + | title = {Phrasal: A Toolkit for New Directions in Statistical Machine Translation}, |
+ | booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation}, | ||
+ | year = {2014} | ||
} | } | ||
Line 38: | Line 45: | ||
The first paper describes the core decoder. The second covers the parameter learning algorithm. | The first paper describes the core decoder. The second covers the parameter learning algorithm. | ||
− | If you | + | If you use the hierarchical lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite: |
@inproceedings{Galley2008, | @inproceedings{Galley2008, | ||
author = {Michel Galley and Christopher D. Manning}, | author = {Michel Galley and Christopher D. Manning}, | ||
Line 48: | Line 55: | ||
== Installation == | == Installation == | ||
− | + | Clone the latest version of [http://github.com/stanfordnlp/phrasal Phrasal] and [http://github.com/stanfordnlp/CoreNLP Stanford CoreNLP]: | |
− | + | git clone http://github.com/stanfordnlp/phrasal | |
+ | git clone http://github.com/stanfordnlp/CoreNLP | ||
− | + | Set the CORENLP_HOME environment variable to the path of your local CoreNLP git repo. Suppose that the repo is in $HOME. You would execute: | |
− | + | export CORENLP_HOME=$HOME/CoreNLP | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | You might want to add the Phrasal <tt>scripts</tt> directory to your shell PATH: | |
− | export | + | export PATH=$PATH:$HOME/phrasal/scripts |
− | + | === Compiling Phrasal === | |
− | + | See the build instructions in the README.md file in the root directory of the Phrasal git repo. | |
− | + | === Language Modeling === | |
− | + | ||
− | + | ||
− | + | Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models, we recommend that you use [http://kheafield.com/code/kenlm/ KenLM]. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.) | |
− | + | Later on in this tutorial we will use KenLM's <tt>lmplz</tt> tool to build language models. | |
− | + | The Phrasal Java-based language model loader can load the ARPA format files created by <tt>lmplz</tt>, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a [http://www.oracle.com/technetwork/java/javase/downloads/ JDK from Oracle]. Next, make sure to set the [http://docs.oracle.com/cd/E19182-01/820-7851/inst_cli_jdk_javahome_t/index.html <tt>JAVA_HOME</tt> environment variable]. Finally, compile the loader: | |
+ | gradle compileKenLM | ||
− | + | In case you are using Mac OS X and you are getting a compilation error caused by a missing <code>jni.h</code> file, replace this line in <tt>compile_JNI.sh</tt> | |
− | For word alignment, we recommend [http://code.google.com/p/berkeleyaligner/ Berkeley Aligner]. The download contains installation instructions. | + | $CXX -I. -DKENLM_MAX_ORDER=7 -I$JAVA_HOME/include -Ikenlm/ \ |
+ | -I$JAVA_HOME/include/linux edu_stanford_nlp_mt_lm_KenLM.cc kenlm/lm/*.o kenlm/util/*.o \ | ||
+ | kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \ | ||
+ | $extra_flags $RT | ||
+ | |||
+ | with the following line: | ||
+ | |||
+ | $CXX -I. -DKENLM_MAX_ORDER=7 -I$JAVA_HOME/include \ | ||
+ | -I"/System/Library/Frameworks/JavaVM.framework/Headers" -Ikenlm/ \ | ||
+ | -I$JAVA_HOME/include/linux edu_stanford_nlp_mt_lm_KenLM.cc kenlm/lm/*.o kenlm/util/*.o \ | ||
+ | kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \ | ||
+ | $extra_flags $RT | ||
+ | |||
+ | To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file. | ||
+ | |||
+ | <code>compile_JNI.sh</code> builds only the part needed for querying language models. If you also want to estimate language models you have to build <code>lmplz</code>: | ||
+ | cd $HOME/phrasal.ver/src-cc/kenlm | ||
+ | ./bjam | ||
+ | |||
+ | === Word Alignments === | ||
+ | |||
+ | For word alignment, we recommend the [http://code.google.com/p/berkeleyaligner/ Berkeley Aligner]. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as [https://code.google.com/p/giza-pp/ Giza++]. | ||
+ | |||
+ | To run symmetrization heuristics like <tt>grow-diag</tt> during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter <tt>writeGIZA</tt> to the Berkeley configuration file. Here is an [http://nlp.stanford.edu/software/phrasal/example_files/ucb-align.conf example configuration file]. | ||
== Data Download and Pre-processing == | == Data Download and Pre-processing == | ||
Line 219: | Line 245: | ||
This step outputs a <tt>.learn-curve</tt> file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm. | This step outputs a <tt>.learn-curve</tt> file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm. | ||
+ | |||
+ | Running the above steps correctly on our system results in the following learning curve on newstest2012: | ||
+ | |||
+ | iter weight vector newstest2012 BLEU | ||
+ | ----------------------------------------------------------------- | ||
+ | 0 newstest2011.example.online.0.binwts 23.87 | ||
+ | 1 newstest2011.example.online.1.binwts 24.18 | ||
+ | 2 newstest2011.example.online.2.binwts 24.16 | ||
+ | 3 newstest2011.example.online.3.binwts 24.09 | ||
+ | 4 newstest2011.example.online.4.binwts 24.14 | ||
+ | 5 newstest2011.example.online.5.binwts 24.16 | ||
+ | 6 newstest2011.example.online.6.binwts 24.18 | ||
+ | 7 newstest2011.example.online.7.binwts 24.09 | ||
+ | |||
+ | The system is very stable on held-out data, achieving a '''maximum BLEU score of 24.18''' at the end of iteration 1. This results compares fairly well to the [http://matrix.statmt.org WMT results] given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes: | ||
+ | |||
+ | 28 min. -- newstest2011 rule extraction | ||
+ | 26 min. -- newstest2012 rule extraction | ||
+ | 32 min. -- tuning on newstest2011 | ||
+ | 1 min. -- decoding newstest2012 | ||
+ | ------- | ||
+ | 87 min. total system train/tune time | ||
=== Troubleshooting === | === Troubleshooting === | ||
Line 240: | Line 288: | ||
This section describes features for advanced users who wish to maximize translation quality. | This section describes features for advanced users who wish to maximize translation quality. | ||
+ | |||
+ | === Word Classes === | ||
+ | |||
+ | Phrasal 3.4 comes with several featurizers that use word classes. To use these featurizers you need mappings from words to classes for the source and the target language. | ||
+ | Phrasal includes an implemenation of a very fast word clustering algorithm that allows you to train word classes on corpora containing billions of tokens within a few a hours. As the following table shows, this is up to three orders of magnitude faster than other popular tools: | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Algorithm (implementation) | ||
+ | !threads | ||
+ | !time (min.sec) | ||
+ | |- | ||
+ | | Brown ([http://github.com/percyliang/brown-cluster wcluster]) | ||
+ | | align="center" | 1 | ||
+ | | align="right" | 1023.39 | ||
+ | |- | ||
+ | | Clark ([http://www.cs.rhul.ac.uk/home/alexc/pos2.tar.gz cluster_neyessen]) | ||
+ | | align="center" | 1 | ||
+ | | align="right" | 890.11 | ||
+ | |- | ||
+ | | Och ([http://code.google.com/p/giza-pp/ mkcls]) | ||
+ | | align="center" | 1 | ||
+ | | align="right" | 199.04 | ||
+ | |- | ||
+ | | Our algorithm | ||
+ | | align="center" | 8 | ||
+ | | align="right" | 2.42 | ||
+ | |} | ||
+ | |||
+ | If you use our implementation for research purposes, then we ask that you cite this paper: | ||
+ | |||
+ | @inproceedings{Green2014b, | ||
+ | author = {Spence Green and Daniel Cer and Christopher D. Manning}, | ||
+ | title = {An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation}, | ||
+ | booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation}, | ||
+ | year = {2014} | ||
+ | } | ||
+ | |||
+ | <br /> | ||
+ | To train the world classes, run <tt>word2class.sh</tt> on all the data for each language: | ||
+ | |||
+ | word2class.sh 10g 1 -name en_cls europarl-v7.fr-en.en.tok europarl-v7.en.tok > en.cls | ||
+ | word2class.sh 10g 1 -name fr_cls europarl-v7.fr-en.fr.tok > fr.cls | ||
+ | |||
+ | To use the word classes in any featurizer you have to add the mappings to your ini file: | ||
+ | |||
+ | [target-class-map] | ||
+ | en.cls | ||
+ | |||
+ | [source-class-map] | ||
+ | fr.cls | ||
=== Feature Engineering API === | === Feature Engineering API === | ||
− | Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. | + | Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. [http://www.cdec-decoder.org/documentation/ext-ff.html Like cdec], feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection. |
− | The Feature API Tutorial | + | The [http://nlp.stanford.edu/software/phrasal/example_files/phrasal-featureapi.pdf Feature API Tutorial] describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm. |
− | Baseline "dense" features are located in <tt>edu.stanford.nlp.mt.decoder.feat</tt>. | + | Baseline "dense" features are located in <tt>edu.stanford.nlp.mt.decoder.feat.base</tt>. |
Examples of discriminative "sparse" features can be found in <tt>edu.stanford.nlp.mt.decoder.feat.sparse</tt>. | Examples of discriminative "sparse" features can be found in <tt>edu.stanford.nlp.mt.decoder.feat.sparse</tt>. | ||
Line 255: | Line 353: | ||
edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature() | edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature() | ||
− | === | + | [[More details on Phrasal discriminative features]] |
+ | |||
+ | === Feature Augmentation === | ||
+ | Phrasal supports domain adaptation via feature augmentation à la Daumé (2007). To enable it, add this option to the *.ini file: | ||
+ | [feature-augmentation] | ||
+ | mode | ||
+ | |||
+ | where "mode" is one of all,dense,extended. | ||
+ | |||
+ | For domain specification, you need to generate an input properties file (See edu.stanford.nlp.mt.util.{InputeProperty,InputProperties}). The input properties file is a set of key/value pairs for each segment in the source input file. For domain splitting, you would have: | ||
+ | |||
+ | Domain=tech | ||
+ | Domain=legal | ||
+ | Domain=nw | ||
+ | |||
+ | Add the input properties file to the *.ini file as follows: | ||
+ | |||
+ | [input-properties] | ||
+ | filename | ||
+ | |||
+ | === Statistical Significance Testing === | ||
+ | Phrasal includes an implementation of the permutation test described by [http://acl.ldc.upenn.edu/W/W05/W05-0908.pdf Riezler and Maxwell]. To obtain p-values for a pair of system outputs, run: | ||
+ | java edu.stanford.nlp.mt.tools.SignificanceTest [bleu|ter] reference_prefix system1 system2 | ||
+ | where <tt>reference_prefix</tt> is a common filename prefix for multiple references. | ||
− | + | === Advanced Parameters === | |
− | + | Additional phrase extraction options are passed via the <tt>OTHER_EXTRACT_OPTS</tt> in the vars file to <tt>edu.stanford.nlp.mt.train.PhraseExtract</tt>. See the usage and javadocs in that package for a description of the options. | |
− | + | Learning options are passed via <tt>ONLINE_OPTS</tt> in the vars file to <tt>edu.stanford.nlp.mt.tune.OnlineTuner</tt>. See the usage and javadocs in that package for a description of the options. | |
− | + | Phrasal decoder options are specified in the <tt>.ini</tt> file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in <tt>edu.stanford.nlp.mt.Phrasal</tt>. | |
=== MERT (Batch) Training === | === MERT (Batch) Training === | ||
Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters: | Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters: | ||
+ | SEED -- Random seed for MERT | ||
+ | N_STARTING_POINTS -- Number of random starting points. | ||
+ | TUNE_NBEST -- n-best list size | ||
+ | OBJECTIVE -- objective function (e.g., bleu) | ||
+ | THREADS -- Number of MERT threads | ||
+ | OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options) | ||
− | + | [[More details on MERT options]] | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
=== Arabic-English Translation === | === Arabic-English Translation === | ||
Line 284: | Line 406: | ||
=== Chinese-English Translation === | === Chinese-English Translation === | ||
− | + | Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a [http://nlp.stanford.edu/software/segmenter.shtml free tokenizer/segmenter for Chinese]. Run it on the Chinese side of the parallel data and you should be ready to go. | |
A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in [http://nlp.stanford.edu/software/corenlp.shtml Stanford CoreNLP]. | A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in [http://nlp.stanford.edu/software/corenlp.shtml Stanford CoreNLP]. |
Latest revision as of 16:38, 14 March 2015
Contents |
Stanford Phrasal User Guide
This guide explains how to set up and train a phrase-based Statistical Machine Translation system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools.
Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with:
- An intuitive feature engineering API
- Large-scale learning with AdaGrad+FOBOS
- Fast search with cube pruning
- Unpruned language modeling with KenLM
This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. The commands in this tutorial are written for bash, but it is relatively easy to adapt them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems.
For background information on statistical machine translation, please see the StatMT site or Philipp Koehn's textbook.
System Requirements
Phrasal is written for Java 1.8+. If your JVM is older than 1.8, then you should upgrade. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).
To compile Phrasal and automatically fetch its dependencies, you need Gradle.
Support
Questions about Phrasal should be directed to the JavaNLP mailing list.
Citations
If you use Stanford Phrasal for research purposes, then we ask that you cite these two papers:
@inproceedings{Green2014, author = {Spence Green and Daniel Cer and Christopher D. Manning}, title = {Phrasal: A Toolkit for New Directions in Statistical Machine Translation}, booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation}, year = {2014} }
@InProceedings{Green2013, author = {Green, Spence and Wang, Sida and Cer, Daniel and Manning, Christopher D.}, title = {Fast and Adaptive Online Training of Feature-Rich Translation Models}, booktitle = {Proceedings of ACL}, year = {2013} }
The first paper describes the core decoder. The second covers the parameter learning algorithm.
If you use the hierarchical lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:
@inproceedings{Galley2008, author = {Michel Galley and Christopher D. Manning}, title = {A Simple and Effective Hierarchical Phrase Reordering Model}, booktitle = {Proceedings of EMNLP}, year = {2008} }
Installation
Clone the latest version of Phrasal and Stanford CoreNLP:
git clone http://github.com/stanfordnlp/phrasal git clone http://github.com/stanfordnlp/CoreNLP
Set the CORENLP_HOME environment variable to the path of your local CoreNLP git repo. Suppose that the repo is in $HOME. You would execute:
export CORENLP_HOME=$HOME/CoreNLP
You might want to add the Phrasal scripts directory to your shell PATH:
export PATH=$PATH:$HOME/phrasal/scripts
Compiling Phrasal
See the build instructions in the README.md file in the root directory of the Phrasal git repo.
Language Modeling
Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models, we recommend that you use KenLM. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.)
Later on in this tutorial we will use KenLM's lmplz tool to build language models.
The Phrasal Java-based language model loader can load the ARPA format files created by lmplz, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a JDK from Oracle. Next, make sure to set the JAVA_HOME environment variable. Finally, compile the loader:
gradle compileKenLM
In case you are using Mac OS X and you are getting a compilation error caused by a missing jni.h
file, replace this line in compile_JNI.sh
$CXX -I. -DKENLM_MAX_ORDER=7 -I$JAVA_HOME/include -Ikenlm/ \ -I$JAVA_HOME/include/linux edu_stanford_nlp_mt_lm_KenLM.cc kenlm/lm/*.o kenlm/util/*.o \ kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \ $extra_flags $RT
with the following line:
$CXX -I. -DKENLM_MAX_ORDER=7 -I$JAVA_HOME/include \ -I"/System/Library/Frameworks/JavaVM.framework/Headers" -Ikenlm/ \ -I$JAVA_HOME/include/linux edu_stanford_nlp_mt_lm_KenLM.cc kenlm/lm/*.o kenlm/util/*.o \ kenlm/util/double-conversion/*.o -shared -o libPhrasalKenLM$SUFFIX $CXXFLAGS $LDFLAGS \ $extra_flags $RT
To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file.
compile_JNI.sh
builds only the part needed for querying language models. If you also want to estimate language models you have to build lmplz
:
cd $HOME/phrasal.ver/src-cc/kenlm ./bjam
Word Alignments
For word alignment, we recommend the Berkeley Aligner. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as Giza++.
To run symmetrization heuristics like grow-diag during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter writeGIZA to the Berkeley configuration file. Here is an example configuration file.
Data Download and Pre-processing
Download the training, development, and test data from the WMT 2013 shared task site:
Unpack the files in a convenient location. Assume that we will build a system for French (fr) to English (en) translation, so we'll need the following files:
Parallel data:
- europarl-v7.fr-en.en
- europarl-v7.fr-en.fr
Monolingual data:
- europarl-v7.en
Development data:
- newstest2011.fr
- newstest2011.en
Test data:
- newstest2012.fr
- newstest2012.en
Scripts:
- tokenizer.perl
- lowercase.perl
To prepare the data, first tokenize and lowercase the parallel, monolingual, development, and test data using the scripts, e.g.:
cat europarl-v7.fr-en.en | tokenizer.perl -l en | lowercase.perl > europarl-v7.fr-en.en.tok
Each processed file should have the .tok extension.
For the bitext, run the corpus cleaner included with Phrasal:
clean-corpus.py europarl-v7.fr-en.en.tok europarl-v7.fr-en.fr.tok
This command will produce two files:
- europarl-v7.fr-en.en.tok.filt.gz
- europarl-v7.fr-en.fr.tok.filt.gz
Word Alignment
Align the bitext with the Berkeley aligner (or another aligner like GIZA++). The Berkeley aligner distribution contains an example configuration file.
In either case, you'll need the A3 files created by the aligner for phrase extraction later on. Assume that you have these files:
training.fr-en.A3 training.en-fr.A3
Language Model Estimation
Estimate a language model from both the monolingual data and the target-side of the parallel data:
lmplz -o 4 < europarl-v7.fr-en.en.tok europarl-v7.en.tok > 4gm.arpa build_binary trie 4gm.arpa 4gm.bin
System Training
Create a new directory for the MT system and copy over the system tuning files:
mkdir $HOME/phrasal.fr-en cd $HOME/phrasal.fr-en cp $HOME/phrasal.ver/example/example.vars fr-en.vars cp $HOME/phrasal.ver/example/example.ini fr-en.ini cp $HOME/phrasal.ver/example/example.binwts fr-en.initial.binwts
You might also want to copy over the language model:
cp 4gm.bin $HOME/phrasal.fr-en
Open the vars file and read the comments. If you've followed the instructions carefully until now, then all of the filenames and paths should match. However, you should verify the files and paths before proceeding.
Tuning and evaluation consists of eight stages, which are configured for convenience in a script. To see the phases, run:
phrasal.sh
If your PATH is configured correctly, you should see the following output:
Usage: phrasal.sh var_file steps ini_file sys_name Use dashes and commas in the steps specification e.g. 1-3,6 Step definitions: 1 Extract phrases from dev set 2 Run tuning 3 Extract phrases from test set 4 Decode test set 5 Output results file 6 Generate a learning curve from an online run
To train and evaluate a system, which we will call "baseline," run this command:
phrasal.sh fr-en.vars 1-6 fr-en.ini baseline
An explanation of each step along with associated parameters in the vars file follows.
Extract phrases from dev set
Purpose: Extract translation rules from the parallel data for the development set.
Phrase extraction parameters:
EXTRACT_SET -- Specifies the bitext files and symmetrization heuristic (default heuristic is grow-diag. THREADS_EXTRACT -- Number of threads to use for phrase extraction. MAX_PHRASE_LEN -- Maximum source phrase length. OTHER_EXTRACT_OPTS -- Other options that are described in edu.stanford.nlp.mt.train.PhraseExtract LO_ARGS -- Parameters for the lexicalized re-ordering model.
Dev set parameters
TUNE_SET_NAME -- The name of the tuning/dev set (e.g., newstest2011). TUNE_SET -- The actual filename of the dev set (e.g., newstest2011.fr.tok). TUNE_REF -- The reference file of the dev set (e.g., newstest2011.en.tok).
Run tuning
Purpose: Estimate model weights from the dev set.
Parameters:
TUNE_MODE -- The tuning algorithm. Here we'll use online. The Advanced features section describes how to enable batch tuning with MERT . INITIAL_WTS -- The initial weights file. TUNE_NBEST -- The n-best list size. ONLINE_OPTS -- Algorithm-specific options for the online tuner.
Extract phrases from test set
Purpose: Extract translation rules from the parallel data for the test set.
Parameters:
DECODE_SET_NAME -- Name of the test set (e.g., newstest2012). DECODE_SET -- The actual filename of the test set (e.g., newstest2012.fr.tok). NBEST -- n-best list size to generate (optional)
Decode test set
Purpose: Decode (translate) the test set.
Output results file
Purpose: Evaluate translation quality using BLEU. This step outputs a .bleu file for the test set.
Parameters:
REFDIR -- The base path of reference directory.
The reference directory must have a particular format. For example, here is how to make a reference directory for newstest2012:
mkdir -p $HOME/refs/newstest2012 cp newstest2012.en.tok $HOME/refs/newstest2012/ref0
The reference filenames must have the prefix ref. Multiple references can be specified by naming the files e.g. ref0,ref1,ref2 etc.
Generate a learning curve from an online run
Purpose: Evaluate weight vectors produced during each training iteration on the test set.
This step outputs a .learn-curve file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.
Running the above steps correctly on our system results in the following learning curve on newstest2012:
iter weight vector newstest2012 BLEU ----------------------------------------------------------------- 0 newstest2011.example.online.0.binwts 23.87 1 newstest2011.example.online.1.binwts 24.18 2 newstest2011.example.online.2.binwts 24.16 3 newstest2011.example.online.3.binwts 24.09 4 newstest2011.example.online.4.binwts 24.14 5 newstest2011.example.online.5.binwts 24.16 6 newstest2011.example.online.6.binwts 24.18 7 newstest2011.example.online.7.binwts 24.09
The system is very stable on held-out data, achieving a maximum BLEU score of 24.18 at the end of iteration 1. This results compares fairly well to the WMT results given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes:
28 min. -- newstest2011 rule extraction 26 min. -- newstest2012 rule extraction 32 min. -- tuning on newstest2011 1 min. -- decoding newstest2012 ------- 87 min. total system train/tune time
Troubleshooting
The phrasal-train-tune.sh script generates three log files:
- $HOME/phrasal.fr-en/newstest2011.baseline.online.log
- $HOME/phrasal.fr-en/logs/newstest2011.baseline.stdout
- $HOME/phrasal.fr-en/logs/newstest2012.newstest2011.baseline.stdout
The first log contains output from step 2. You can see the tuning objective function score by searching for "BLEU", e.g.:
grep BLEU newstest2011.baseline.online.log
BLEU scores should increase from one epoch to the next.
The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.
You can also inspect the inspect the intermediate weight files generated by the learning algorithm:
java edu.stanford.nlp.mt.tools.PrintWeights newstest2011.baseline.online.2.binwts
Advanced/Additional Features
This section describes features for advanced users who wish to maximize translation quality.
Word Classes
Phrasal 3.4 comes with several featurizers that use word classes. To use these featurizers you need mappings from words to classes for the source and the target language. Phrasal includes an implemenation of a very fast word clustering algorithm that allows you to train word classes on corpora containing billions of tokens within a few a hours. As the following table shows, this is up to three orders of magnitude faster than other popular tools:
Algorithm (implementation) | threads | time (min.sec) |
---|---|---|
Brown (wcluster) | 1 | 1023.39 |
Clark (cluster_neyessen) | 1 | 890.11 |
Och (mkcls) | 1 | 199.04 |
Our algorithm | 8 | 2.42 |
If you use our implementation for research purposes, then we ask that you cite this paper:
@inproceedings{Green2014b, author = {Spence Green and Daniel Cer and Christopher D. Manning}, title = {An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation}, booktitle = {In Proceddings of the Ninth Workshop on Statistical Machine Translation}, year = {2014} }
To train the world classes, run word2class.sh on all the data for each language:
word2class.sh 10g 1 -name en_cls europarl-v7.fr-en.en.tok europarl-v7.en.tok > en.cls word2class.sh 10g 1 -name fr_cls europarl-v7.fr-en.fr.tok > fr.cls
To use the word classes in any featurizer you have to add the mappings to your ini file:
[target-class-map] en.cls [source-class-map] fr.cls
Feature Engineering API
Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Like cdec, feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection.
The Feature API Tutorial describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.
Baseline "dense" features are located in edu.stanford.nlp.mt.decoder.feat.base.
Examples of discriminative "sparse" features can be found in edu.stanford.nlp.mt.decoder.feat.sparse.
Feature templates are via reflection in the .ini file as follows:
[additional-featurizers] edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature()
More details on Phrasal discriminative features
Feature Augmentation
Phrasal supports domain adaptation via feature augmentation à la Daumé (2007). To enable it, add this option to the *.ini file:
[feature-augmentation] mode
where "mode" is one of all,dense,extended.
For domain specification, you need to generate an input properties file (See edu.stanford.nlp.mt.util.{InputeProperty,InputProperties}). The input properties file is a set of key/value pairs for each segment in the source input file. For domain splitting, you would have:
Domain=tech Domain=legal Domain=nw
Add the input properties file to the *.ini file as follows:
[input-properties] filename
Statistical Significance Testing
Phrasal includes an implementation of the permutation test described by Riezler and Maxwell. To obtain p-values for a pair of system outputs, run:
java edu.stanford.nlp.mt.tools.SignificanceTest [bleu|ter] reference_prefix system1 system2
where reference_prefix is a common filename prefix for multiple references.
Advanced Parameters
Additional phrase extraction options are passed via the OTHER_EXTRACT_OPTS in the vars file to edu.stanford.nlp.mt.train.PhraseExtract. See the usage and javadocs in that package for a description of the options.
Learning options are passed via ONLINE_OPTS in the vars file to edu.stanford.nlp.mt.tune.OnlineTuner. See the usage and javadocs in that package for a description of the options.
Phrasal decoder options are specified in the .ini file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in edu.stanford.nlp.mt.Phrasal.
MERT (Batch) Training
Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:
SEED -- Random seed for MERT N_STARTING_POINTS -- Number of random starting points. TUNE_NBEST -- n-best list size OBJECTIVE -- objective function (e.g., bleu) THREADS -- Number of MERT threads OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options)
Arabic-English Translation
Arabic-English is a common research language pair in the United States. The steps in this tutorial can be used to build an Ar-En system with the exception of Arabic pre-processing and segmentation. The Stanford NLP group provides a free tokenizer/segmenter for Arabic. Run it on the Arabic side of the parallel data and you should be ready to go.
A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.
Chinese-English Translation
Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a free tokenizer/segmenter for Chinese. Run it on the Chinese side of the parallel data and you should be ready to go.
A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.