Hi there!

These scripts found here will facilitate organizing and cleaning up the corpus used in the paper:

   "Tradition and Modernity in 20th Century Chinese Poetry"
   presented at the 2013 NAACL Workshop on Computational Linguistics for Literature
   authors Rob Voigt and Dan Jurafsky (robvoigt, jurafsky @stanford.edu)

First, download and unzip this file:


To get the corpus from scratch, run the following in the directory where you unzipped the above:

   wget http://www.shigeku.com/shiku/xs/xz/zhongguoshige.zip

This will result in the creation of a 'poems' directory, containing all the
poems in the corpus in UTF-8 encoding, numbered and named by author, as well
as a 'meta' directory, containing a small amount of metadata for each poet
(active decade, poetic 'school' affiliation, name in Chinese characters).

The poem files are named according to the following convention:


and the metadata files are named according to the following convention:


Please let us know if you have any questions!