This page contains information about latest research on neural machine translation (NMT) at Stanford NLP group. We release our codebase which produces state-of-the-art results in various translation tasks such as English-German and English-Czech. In addtion, to encourage reproducibility and increase transparency, we release the preprocessed data that we used to train our models as well as our pretrained models that are readily usable with our codebase.
For hybrid NMT, please use this codebase and cite:
Minh-Thang Luong, and Christopher D. Manning. 2016. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. In Association for Computational Linguistics (ACL). [pdf] [bib]
For general attention-based NMT, please use this codebase and cite:
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]
For pruning NMT, please cite (and let us know if you are interested in our code!):
Abigail See*, Minh-Thang Luong*, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. In Computational Natural Language Learning (CoNLL). [pdf] [bib]
Train (15.8M sentence pairs): [train.en] [train.cs] Test: [newstest2013.en] [newstest2013.cs] [newstest2014.en] [newstest2014.cs] [newstest2015.en] [newstest2015.cs] Word Vocabularies (top frequent words): [vocab.1K.en] [vocab.1K.cs] [vocab.10K.en] [vocab.10K.cs] [vocab.20K.en] [vocab.20K.cs] [vocab.50K.en] [vocab.50K.cs] Dictionary (extracted from alignment data): [dict.en-cs] Character Vocabularies: [vocab.char.200.en] [vocab.char.200.cs] Note: we used this dataset in our ACL'16 paper [bib].
Train (4.5M sentence pairs): [train.en] [train.de] Test: [newstest2012.en] [newstest2012.de] [newstest2013.en] [newstest2013.de] [newstest2014.en] [newstest2014.de] [newstest2015.en] [newstest2015.de] Vocabularies (top 50K frequent words): [vocab.50K.en] [vocab.50K.de] Dictionary (extracted from alignment data): [dict.en-de] Note: we used this dataset in our EMNLP'15 paper [bib]. Also, for historical reasons, we split compound words, e.g., "rich-text format" --> rich ##AT##-##AT## text format.
Train (133K sentence pairs): [train.en] [train.vi] Test: [tst2012.en] [tst2012.vi] [tst2013.en] [tst2013.vi] Vocabularies (top 50K frequent words): [vocab.en] [vocab.vi] Dictionary (extracted from alignment data): [dict.en-vi] Note: we used this dataset in our IWSLT'15 paper [bib].
For any comments or questions, please email the first author.