Chinese Natural Language Processing and Speech Processing
We work on a wide variety of research in Chinese Natural Language Processing and speech processing, including word segmentation, part-of-speech tagging, syntactic and semantic parsing, machine translation, disfluency detection, prosody, and other areas. We provide softwares for Chinese word segmentation, Chinese parsing and Chinese part-of-speech tagging.More details on each topic:
Chinese Word SegmentationOur Chinese word segmenter relies on a linear-chain conditional random filed (CRF) model, which treats word segmentation as a binary decision task. It uses three categories of features: character identity n-grams, morphological and character reduplication features. As shown in the figure on the right, it also exploits lexicons and proper noun features to improve segmentation consistency, which is beneficial in tasks such as machine translation (MT) and information retrieval. In (Chang et al., 2008), we show that this increase of consistency yields an improvement of 0.32 BLEU point on a standard test set, and an additional improvement of 0.73 BLEU when segmentation granularity is optimized for the MT task.
Parsing and Grammatical Relations
The Chinese parser is based on the ACL 2003 paper:
Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank?. ACL 2003.
In addition to PCFG parsing, the Stanford Chinese parser can also output
a set of Chinese grammatical relations that describes more
semantically abstract relations between words. An example Chinese sentence looks like:
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 2009. Discriminative Reordering with Chinese Grammatical Relations Features.
Part-of-Speech TaggingThe Stanford part-of-speech tagger takes word-segmented Chinese text as input and assigns a part of speech to each word (and other tokens), such as a noun or a verb. This Chinese POS tagger is designed for LDC style word segmented texts, and adopts a subset of features from:
Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005. Morphological features help POS tagging of unknown words across language varieties.Its overall accuracy is 93.65% and the unknown word accuracy is 84.84%.
Named Entity Recognition
We have done extensive research on improving Chinese NER performance
using semi-supervised learning methods with bilingual parallel text.
Our results yield significant (~3% F1) improvements over strong CRF baselines
that are enhanced with distributional similarity features.
Mengqiu Wang and Christopher D. Manning. 2013. Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013Software Instructions:
Follow these instructions to reproduce experiments reported in these papers.
Our Chinese speech research has focused on areas like the study and detection of disfluencies (filled pauses like uh and word fragments), prosody, and the detection of speech acts.
Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning
Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers
Named Entity Recognition with Bilingual Constraints
Discriminative Reordering with Chinese Grammatical Relations Features
Disambiguating "DE" for Chinese-English Machine Translation
Optimizing Chinese Word Segmentation for Machine Translation Performance
Stanford University's Chinese-to-English Statistical Machine Translation System for the 2008 NIST Evaluation
Detection of Word Fragments in Mandarin Telephone Conversation
A Conditional Random Field Word Segmenter for SIGHAN Bakeoff 2005
Morphological features help POS tagging of unknown words across language varieties
Accent Detection and Speech Recognition for Shanghai-Accented Mandarin
A preliminary study of Mandarin filled pauses
Detection of Questions in Chinese Conversation
Is it harder to parse Chinese, or the Chinese Treebank?