Hi there!
These scripts found here will facilitate organizing and cleaning up the corpus used in the paper:
"Tradition and Modernity in 20th Century Chinese Poetry"
presented at the 2013 NAACL Workshop on Computational Linguistics for Literature
authors Rob Voigt and Dan Jurafsky (robvoigt, jurafsky @stanford.edu)
http://stanford.edu/~jurafsky/traditionandmodernity.pdf
First, download and unzip this file:
http://nlp.stanford.edu/robvoigt/chpoetry/shigeku_corpus_builder.zip
To get the corpus from scratch, run the following in the directory where you unzipped the above:
wget http://www.shigeku.com/shiku/xs/xz/zhongguoshige.zip
./build_corpus.sh
This will result in the creation of a 'poems' directory, containing all the
poems in the corpus in UTF-8 encoding, numbered and named by author, as well
as a 'meta' directory, containing a small amount of metadata for each poet
(active decade, poetic 'school' affiliation, name in Chinese characters).
The poem files are named according to the following convention:
poems/[author]_[poem#]
and the metadata files are named according to the following convention:
meta/[author].meta
Please let us know if you have any questions!