Chinese Natural Language Processing and Speech Processing


We work on a wide variety of research in Chinese Natural Language Processing and speech processing, including word segmentation, part-of-speech tagging, syntactic and semantic parsing, machine translation, disfluency detection, prosody, and other areas. We provide softwares for Chinese word segmentation, Chinese parsing and Chinese part-of-speech tagging.

More details on each topic:

Chinese Word Segmentation

Our Chinese word segmenter relies on a linear-chain conditional random filed (CRF) model, which treats word segmentation as a binary decision task. It uses three categories of features: character identity n-grams, morphological and character reduplication features. As shown in the figure on the right, it also exploits lexicons and proper noun features to improve segmentation consistency, which is beneficial in tasks such as machine translation (MT) and information retrieval. In (Chang et al., 2008), we show that this increase of consistency yields an improvement of 0.32 BLEU point on a standard test set, and an additional improvement of 0.73 BLEU when segmentation granularity is optimized for the MT task.

Parsing and Grammatical Relations

The Chinese parser is based on the ACL 2003 paper:

Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank?. ACL 2003.

In addition to PCFG parsing, the Stanford Chinese parser can also output a set of Chinese grammatical relations that describes more semantically abstract relations between words. An example Chinese sentence looks like:

Details of the Chinese grammatical relations are in the 2009 SSST paper:
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 2009. Discriminative Reordering with Chinese Grammatical Relations Features.

Part-of-Speech Tagging

The Stanford part-of-speech tagger takes word-segmented Chinese text as input and assigns a part of speech to each word (and other tokens), such as a noun or a verb. This Chinese POS tagger is designed for LDC style word segmented texts, and adopts a subset of features from:
Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005. Morphological features help POS tagging of unknown words across language varieties.

Its overall accuracy is 93.65% and the unknown word accuracy is 84.84%.

Named Entity Recognition

We have done extensive research on improving Chinese NER performance using semi-supervised learning methods with bilingual parallel text. Our results yield significant (~3% F1) improvements over strong CRF baselines that are enhanced with distributional similarity features.

Details of the semi-supervised learning with bitext work can be found in the following papers:
Mengqiu Wang and Christopher D. Manning. 2013. Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013
Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition. ACL 2013
Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers. AAAI 2013 (Outstanding Paper Award Honorable Mention)
Wanxiang Che, Mengqiu Wang and Christopher D. Manning. 2013. Named Entity Recognition with Bilingual Constraints. NAACL 2013
Software Instructions:
Follow these instructions to reproduce experiments reported in these papers.

Speech Processing

Our Chinese speech research has focused on areas like the study and detection of disfluencies (filled pauses like uh and word fragments), prosody, and the detection of speech acts.




