flags Reid Pryzant

JESC

Japanese-English Subtitle Corpus
English | 日本語

About
5/12/2019 -- new version -- de-duplicated and slightly cleaner


JESC aims to support the research and development of machine translation systems, information extraction, and other language processing techniques.

JESC is the product of a collaboration between Stanford University, Google Brain, and Rakuten Institute of Technology. It was created by crawling the internet for movie and tv subtitles and aligining their captions. It is one of the largest freely available EN-JA corpus, and covers the poorly represented domain of colloquial language.

You can download the scripts, tools, and crawlers used to create this dataset on Github.

You can read the paper here.

These data are released under a Creative Commons (CC) license.

Contents
  • A large corpus consisting of 2.8 million sentences.
  • Translations of casual language, colloquialisms, expository writing, and narrative discourse. These are domains that are hard to find in JA-EN MT.
  • Pre-processed data, including tokenized train/dev/test splits.
  • Code for making your own crawled datasets and tools for manipulating MT data.


Split Phrase Pairs
Raw 2,801,388
Train 2,797,388
Dev 2000
Test 2000

Download

Cite
@ARTICLE{pryzant_jesc_2018,
   author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.},
    title = "{JESC: Japanese-English Subtitle Corpus}",
  journal = {Language Resources and Evaluation Conference (LREC)},
 keywords = {Computer Science - Computation and Language},
     year = 2018
}