Hi there!

This page has the data and annotations from the paper:

   Rob Voigt and Dan Jurafsky, 2015.
   The Users Who Say "Ni": Audience Identification in Chinese-language Restaurant Reviews.
   Proceedings of ACL - Short Papers.

The data is available here.

It includes more than 300k reviews from 5,986 restaurants on the Chinese-language review site Dianping. The data is formatted as a utf-8 json dictionary where the base-level keys are shop IDs, such that the shop-level subdictionary with the key "585181" contains data from the page http://www.dianping.com/shop/585181 and the associated reviews.

Each restaurant has metadata regarding its average cost, location (city, district, address), overall ratings, and category. Each review has all available metadata: the user's name, ratings, raw text and word-segmented text of the review, and so on. When a piece of a data was not available it is left null. Ratings include overall ("star_rank"), service ("fuwu"), environment ("huanjing"), and flavor ("kouwei").

All reviews with "ni" annotations as described in the paper have them as a list attached to the review with the key "annotations", where the list items are annotations for each "ni" in the text, in the order they appear. The annotations used in the paper are g, s, r, and w for generic, shop, reader, and writer, respectively. Additional annotations include i, n, and o for idiomatic, non-"you", and other.

If you have any questions, feel free to e-mail me!