These scripts found here will facilitate organizing and cleaning up the corpus used in the paper:
"Tradition and Modernity in 20th Century Chinese Poetry"
presented at the 2013 NAACL Workshop on Computational Linguistics for Literature
authors Rob Voigt and Dan Jurafsky (robvoigt, jurafsky @stanford.edu)
First, download and unzip this file:
To get the corpus from scratch, run the following in the directory where you unzipped the above:
This will result in the creation of a 'poems' directory, containing all the
poems in the corpus in UTF-8 encoding, numbered and named by author, as well
as a 'meta' directory, containing a small amount of metadata for each poet
(active decade, poetic 'school' affiliation, name in Chinese characters).
The poem files are named according to the following convention:
and the metadata files are named according to the following convention:
Please let us know if you have any questions!