Neural Machine Translation

This page contains information about latest research on neural machine translation (NMT) at Stanford NLP group. We release our codebase which produces state-of-the-art results in various translation tasks such as English-German and English-Czech. In addtion, to encourage reproducibility and increase transparency, we release the preprocessed data that we used to train our models as well as our pretrained models that are readily usable with our codebase.

People

Codebase

For hybrid NMT, please use this codebase and cite:

Minh-Thang Luong, and Christopher D. Manning. 2016. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. In Association for Computational Linguistics (ACL). [pdf] [bib]

For general attention-based NMT, please use this codebase and cite:

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

For pruning NMT, please cite (and let us know if you are interested in our code!):

Abigail See*, Minh-Thang Luong*, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. In Computational Natural Language Learning (CoNLL). [pdf] [bib]

Preprocessed Data

WMT'15 English-Czech data [Large]

Train (15.8M sentence pairs): [train.en] [train.cs]
Test: [newstest2013.en] [newstest2013.cs] [newstest2014.en] [newstest2014.cs] [newstest2015.en] [newstest2015.cs]
Word Vocabularies (top frequent words): [vocab.1K.en] [vocab.1K.cs] [vocab.10K.en] [vocab.10K.cs] [vocab.20K.en] [vocab.20K.cs] [vocab.50K.en] [vocab.50K.cs]
Dictionary (extracted from alignment data): [dict.en-cs]
Character Vocabularies: [vocab.char.200.en] [vocab.char.200.cs]
Note: we used this dataset in our ACL'16 paper [bib].
WMT'14 English-German data [Medium]

Train (4.5M sentence pairs): [train.en] [train.de]
Test: [newstest2012.en] [newstest2012.de] [newstest2013.en] [newstest2013.de] [newstest2014.en] [newstest2014.de] [newstest2015.en] [newstest2015.de]
Vocabularies (top 50K frequent words): [vocab.50K.en] [vocab.50K.de]
Dictionary (extracted from alignment data): [dict.en-de]
Note: we used this dataset in our EMNLP'15 paper [bib].
Also, for historical reasons, we split compound words, e.g., "rich-text format" --> rich ##AT##-##AT## text format.
IWSLT'15 English-Vietnamese data [Small]

Train (133K sentence pairs): [train.en] [train.vi]
Test: [tst2012.en] [tst2012.vi] [tst2013.en] [tst2013.vi]
Vocabularies (top 50K frequent words): [vocab.en] [vocab.vi]
Dictionary (extracted from alignment data): [dict.en-vi]
Note: we used this dataset in our IWSLT'15 paper [bib].

Pretrained Models

We release pretrained models that are readily usable with our Matlab code.
Note: to use these models, a GPU device is required. To convert these models to be used in a CPU, consider this script.

WMT'15 English-Czech hybrid models
We train 4 models of the same architecture (global attention, bilinear form, dropout, 2-layer character-level models):
1. Model 1
2. Model 2
3. Model 3
4. Model 4
WMT'14 English-German attention-based models
IWSLT'15 English-Vietnamese attention-based models

Contact Information

For any comments or questions, please email the first author.