Tom Emerson on Word Segmentation in Korean






From: Tom Emerson Tree@basistech.com
To: "'cmanning@acm.org'" cmanning@acm.org, "'hinrich@hotmail.com'" 
hinrich@hotmail.com
Subject: Errata to the errata for FSNLP
Date: Thu, 9 Dec 1999 22:45:44 -0500

While reading over the errata on the book's website, I noticed a correction to page 129, line 12, that you retract the statement that Korean is a hard language to segment because it lacks interword space. In my humble opinion this retraction isn't called for. While Korean is indeed written with spaces, they are not necessarily placed at word boundaries.

There are two main components in Korean orthography: the eumjeol and the eojeol. The eumjeol can be thought of as a syllable, consisting of either a hanja (Chinese characters, now used only in Korea: North Korean orthography does not use them) or a hangul syllable (hangul being the name of the Korean alphabet: a hangul is composed of one to three jamo).

An eojeol is sequence of one or more eumjeol, separated by spaces. A eojeol can represent a single inflected lexeme (Korean is quite agglutinative) or several lexemes. The placement of spaces in Korean is often a matter of style than morphology, and often appear where a pause in speech would be heard. The components of an eojeol, the Korean analog of "morphemes", are called hyung-tae-so.

Writing a Korean morphological analyzer is a very difficult task, and most successful implementations are purely statistical in nature. Affixation can result in internal sound changes in the stem which in turn lead to lexical ambiguity: developing algorithms for Korean is an active area of research in both the private sector and academia.


Christopher Manning and Hinrich Schütze -- last modified 12 December 1999.