Hi there!

These scripts found here will facilitate organizing and cleaning up the corpus used in the paper:

   "Tradition and Modernity in 20th Century Chinese Poetry"
   presented at the 2013 NAACL Workshop on Computational Linguistics for Literature
   authors Rob Voigt and Dan Jurafsky (robvoigt, jurafsky @stanford.edu)
  http://stanford.edu/~jurafsky/traditionandmodernity.pdf

First, download and unzip this file:

   http://nlp.stanford.edu/robvoigt/chpoetry/shigeku_corpus_builder.zip

To get the corpus from scratch, run the following in the directory where you unzipped the above:

   wget http://www.shigeku.com/shiku/xs/xz/zhongguoshige.zip
   ./build_corpus.sh

This will result in the creation of a 'poems' directory, containing all the
poems in the corpus in UTF-8 encoding, numbered and named by author, as well
as a 'meta' directory, containing a small amount of metadata for each poet
(active decade, poetic 'school' affiliation, name in Chinese characters).

The poem files are named according to the following convention:

   poems/[author]_[poem#]

and the metadata files are named according to the following convention:

   meta/[author].meta

Please let us know if you have any questions!