Software/Phrasal

From NLPWiki
Revision as of 12:01, 27 September 2013 by SpenceGreen (Talk | contribs)

Jump to: navigation, search

Contents

Phrasal User Guide

This guide explains how to set up and train a phrase-based Statistical Machine Translation system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools.

Phrasal is designed for fast training of large-scale, discriminative translation models with:

  • An intuitive feature engineering API
  • Large-scale learning with AdaGrad+FOBOS
  • Fast search with cube pruning
  • Unpruned language modeling with KenLM

This guide assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. The commands in this tutorial are written for bash, but it is relatively easy to adapt them to other shells.

For background information on statistical machine translation, please see the StatMT site or Philipp Koehn's textbook.

Phrasal is written for Java 1.6+. If your JVM is older than 1.6, then you should upgrade. Many of the utility scripts included with Phrasal assume Python 2.7+ (but not Python3).

Questions about Phrasal should be directed to the JavaNLP mailing list.

If you use Phrasal for research purposes, then we ask that you cite these two papers:

@inproceedings{Cer2010,
 author = {Cer, Daniel and Galley, Michel and Jurafsky, Daniel and Manning, Christopher D.},
 title = {Phrasal: a toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features},
 booktitle = {Proceedings of the NAACL HLT 2010 Demonstration Session},
 year = {2010}
}
@InProceedings{Green2013,
 author    = {Green, Spence  and  Wang, Sida  and  Cer, Daniel  and  Manning, Christopher D.},
 title     = {Fast and Adaptive Online Training of Feature-Rich Translation Models},
 booktitle = {Proceedings of ACL},
 year      = {2013}
}

The first paper describes the core decoder. The second covers the parameter learning algorithm.

If you enable the lexicalized re-ordering model (which is enabled in the system described in this tutorial), then please cite:

@inproceedings{Galley2008,
 author = {Michel Galley and Christopher D. Manning},
 title = {A Simple and Effective Hierarchical Phrase Reordering Model},
 booktitle = {Proceedings of EMNLP},
 year = {2008}
}

Installation

Download the latest version of Phrasal and the latest version of Stanford CoreNLP.

The Phrasal download file has the format phrasal.ver.tar.gz where ver is the current version number. Assume that you download to the directory $HOME=/home/me.

The CoreNLP download file has the format stanford-corenlp-full-date.zip where date is the release date.

Change to your download directory and unpack the archives:

cd $HOME
tar -xzf phrasal.ver.tar.gz
unzip stanford-corenlp-full-date.zip
cd phrasal.ver

Add the Phrasal and CoreNLP jar files to your Java CLASSPATH in the following order:

export CLASSPATH=$CLASSPATH:$HOME/stanford-corenlp-full-date/stanford-corenlp-ver.jar:$HOME/phrasal.ver/phrasal.ver.jar

Add the scripts directory to your shell PATH:

export PATH="$PATH":$HOME/phrasal.ver/scripts

Now build the C++ code for KenLM:

cd $HOME/phrasal.ver/src-cc
./compile_JNI.sh

KenLM

To build language models, you'll need to download and install KenLM. Both the KenLM site and download package contain detailed installation instructions.

We will use lmplz later on to build language models.

Berkeley aligner

For word alignment, we recommend Berkeley Aligner. The download contains installation instructions.

Data Download and Pre-processing

Download the training, development, and test data from the WMT 2013 shared task site:

Unpack the files in a convenient location. Assume that we will build a system for French (fr) to English (en) translation, so we'll need the following files:

Parallel data:

  • europarl-v7.fr-en.en
  • europarl-v7.fr-en.fr

Monolingual data:

  • europarl-v7.en

Development data:

  • newstest2011.fr
  • newstest2011.en

Test data:

  • newstest2012.fr
  • newstest2012.en

Scripts:

  • tokenizer.perl
  • lowercase.perl

To prepare the data, first tokenize and lowercase the parallel, monolingual, development, and test data using the scripts, e.g.:

cat europarl-v7.fr-en.en | tokenizer.perl -l en | lowercase.perl > europarl-v7.fr-en.en.tok

Each processed file should have the .tok extension.

For the bitext, run the corpus cleaner included with Phrasal:

clean-corpus.py europarl-v7.fr-en.en.tok europarl-v7.fr-en.fr.tok

This command will produce two files:

  • europarl-v7.fr-en.en.tok.filt.gz
  • europarl-v7.fr-en.fr.tok.filt.gz

Word Alignment

Align the bitext with the Berkeley aligner (or another aligner like GIZA++). The Berkeley aligner distribution contains an example configuration file.

In either case, you'll need the A3 files created by the aligner for phrase extraction later on. Assume that you have these files:

training.fr-en.A3
training.en-fr.A3

Language Model Estimation

Estimate a language model from both the monolingual data and the target-side of the parallel data:

lmplz -o 4 < europarl-v7.fr-en.en.tok europarl-v7.en.tok > 4gm.arpa
build_binary trie 4gm.arpa 4gm.bin

System Training

Create a new directory for the MT system and copy over the system tuning files:

mkdir $HOME/phrasal.fr-en
cd $HOME/phrasal.fr-en
cp $HOME/phrasal.ver/example/example.vars fr-en.vars
cp $HOME/phrasal.ver/example/example.ini fr-en.ini
cp $HOME/phrasal.ver/example/example.binwts fr-en.initial.binwts

You might also want to copy over the language model:

cp 4gm.bin $HOME/phrasal.fr-en

Open the vars file and read the comments. If you've followed the instructions carefully until now, then all of the filenames and paths should match. However, you should verify the files and paths before proceeding.

Tuning and evaluation consists of eight stages, which are configured for convenience in a script. To see the phases, run:

phrasal.sh

If your PATH is configured correctly, you should see the following output:

Usage: phrasal.sh var_file steps ini_file sys_name
Use dashes and commas in the steps specification e.g. 1-3,6
Step definitions:
 1  Extract phrases from dev set
 2  Run tuning
 3  Extract phrases from test set
 4  Decode test set
 5  Output results file
 6  Generate a learning curve from an online run

To train and evaluate a system, which we will call "baseline," run this command:

phrasal.sh fr-en.vars 1-6 fr-en.ini baseline

An explanation of each step along with associated parameters in the vars file follows.

Extract phrases from dev set

Purpose: Extract translation rules from the parallel data for the development set.

Phrase extraction parameters:

EXTRACT_SET -- Specifies the bitext files and symmetrization heuristic (default heuristic is grow-diag.
THREADS_EXTRACT -- Number of threads to use for phrase extraction.
MAX_PHRASE_LEN -- Maximum source phrase length.
OTHER_EXTRACT_OPTS -- Other options that are described in edu.stanford.nlp.mt.train.PhraseExtract
LO_ARGS -- Parameters for the lexicalized re-ordering model.

Dev set parameters

TUNE_SET_NAME -- The name of the tuning/dev set (e.g., newstest2011).
TUNE_SET -- The actual filename of the dev set (e.g., newstest2011.fr.tok).
TUNE_REF -- The reference file of the dev set (e.g., newstest2011.en.tok).

Run tuning

Purpose: Estimate model weights from the dev set.

Parameters:

TUNE_MODE -- The tuning algorithm. Here we'll use online. The Advanced features section describes how to enable batch tuning with MERT .
INITIAL_WTS -- The initial weights file.
TUNE_NBEST -- The n-best list size.
ONLINE_OPTS -- Algorithm-specific options for the online tuner.

Extract phrases from test set

Purpose: Extract translation rules from the parallel data for the test set.

Parameters:

DECODE_SET_NAME -- Name of the test set (e.g., newstest2012).
DECODE_SET -- The actual filename of the test set (e.g., newstest2012.fr.tok).
NBEST -- n-best list size to generate (optional)

Decode test set

Purpose: Decode (translate) the test set.

Output results file

Purpose: Evaluate translation quality using BLEU. This step outputs a .bleu file for the test set.

Parameters:

REFDIR -- The base path of reference directory.

The reference directory must have a particular format. For example, here is how to make a reference directory for newstest2012:

mkdir -p $HOME/refs/newstest2012
cp newstest2012.en.tok $HOME/refs/newstest2012/ref0

The reference filenames must have the prefix ref. Multiple references can be specified by naming the files e.g. ref0,ref1,ref2 etc.

Generate a learning curve from an online run

Purpose: Evaluate weight vectors produced during each training iteration on the test set.

This step outputs a .learn-curve</t> file for the test set. This file can be used to identify the weight vector that generalizes best from those generated by the tuning algorithm.

Troubleshooting

The <tt>phrasal-train-tune.sh script generates three log files:

  • $HOME/phrasal.fr-en/newstest2011.baseline.online.log
  • $HOME/phrasal.fr-en/logs/newstest2011.baseline.stdout.log
  • $HOME/phrasal.fr-en/logs/newstest2012.newstest2011.baseline.stdout.log

The first log contains output from step 3. You can see the tuning objective function score by searching for "BLEU", e.g.:

grep BLEU newstest2011.baseline.online.log

BLEU scores should increase from one epoch to the next.

The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.

Advanced/Additional Features

This section describes features for advanced users who wish to maximize translation quality.

Feature Engineering API

Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Unlike other MT systems like cdec and Moses, Phrasal loads new feature templates via reflection, so you don't need to recompile the whole system every time you change a feature.

The Feature API Tutorial---see the /doc directory in the download package---describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.

Baseline "dense" features are located in edu.stanford.nlp.mt.decoder.feat.

Examples of discriminative "sparse" features can be found in edu.stanford.nlp.mt.decoder.feat.sparse.

Feature templates are via reflection in the .ini file as follows:

[additional-featurizers]
edu.stanford.nlp.mt.decoder.feat.sparse.MyFeature()

Decoder Parameters

Phrasal decoder options are specified in the .ini file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc.

For the full list of options, see the javadocs in edu.stanford.nlp.mt.Phrasal.

Online Learning Parameters

See the usage and javadocs in edu.stanford.nlp.mt.tune.OnlineTuner for a description of the parameters supported by the online tuning algorithm.

MERT (Batch) Training

Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:

SEED -- Random seed for MERT N_STARTING_POINTS -- Number of random starting points. TUNE_NBEST -- n-best list size OBJECTIVE -- objective function (e.g., bleu) THREADS -- Number of MERT threads OPT_FLAGS -- MERT configuration parameters (see scripts/phrasal-mert.pl for a full list of options)

Arabic-English Translation

Arabic-English is a common research language pair in the United States. The steps in this tutorial can be used to build an Ar-En system with the exception of Arabic pre-processing and segmentation. The Stanford NLP group provides a free tokenizer/segmenter for Arabic. Run it on the Arabic side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.

Chinese-English Translation

Arabic-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a free tokenizer/segmenter for Chinese. Run it on the Chinese side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.