edu.stanford.nlp.trees.international.pennchinese
Class ChineseUtils
java.lang.Object
edu.stanford.nlp.trees.international.pennchinese.ChineseUtils
public class ChineseUtils
- extends Object
This class contains a few String constants and
static methods for dealing with Chinese text.
Warning: The code contains a version that uses codePoint methods
to handle full Unicode. But it seems to tickle some bugs in
Sun's JDK 1.5. It works correctly with JDK 1.6. By default it is
disabled and a version that only handles BMP characters is used. The
latter prints a warning message if it sees a high-surrogate character.
- Author:
- Christopher Manning
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ONEWHITE
public static final String ONEWHITE
- See Also:
- Constant Field Values
WHITE
public static final String WHITE
- See Also:
- Constant Field Values
WHITEPLUS
public static final String WHITEPLUS
- See Also:
- Constant Field Values
NUMBERS
public static final String NUMBERS
- See Also:
- Constant Field Values
MID_DOT_REGEX_STR
public static final String MID_DOT_REGEX_STR
- See Also:
- Constant Field Values
LEAVE
public static final int LEAVE
- See Also:
- Constant Field Values
ASCII
public static final int ASCII
- See Also:
- Constant Field Values
NORMALIZE
public static final int NORMALIZE
- See Also:
- Constant Field Values
FULLWIDTH
public static final int FULLWIDTH
- See Also:
- Constant Field Values
DELETE
public static final int DELETE
- See Also:
- Constant Field Values
DELETE_EXCEPT_BETWEEN_ASCII
public static final int DELETE_EXCEPT_BETWEEN_ASCII
- See Also:
- Constant Field Values
MAX_LEGAL
public static final int MAX_LEGAL
- See Also:
- Constant Field Values
isNumber
public static boolean isNumber(char c)
normalize
public static String normalize(String in)
normalize
public static String normalize(String in,
int ascii,
int spaceChar)
normalize
public static String normalize(String in,
int ascii,
int spaceChar,
int midDot)
- This will normalize a Unicode String in various ways. This routine
correctly handles characters outside the basic multilingual plane.
- Parameters:
in
- The String to be normalizedascii
- For characters conceptually in the ASCII range of
! through ~ (U+0021 through U+007E or U+FF01 through U+FF5E),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them from the Chinese Full Width range
to ASCII values, and if it is FULLWIDTH then do the reverse.spaceChar
- For characters that satisfy Character.isSpaceChar(),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them to the space character U+0020, and
if it is FULLWIDTH then map them to U+3000.midDot
- For characters that satisfy Character.isSpaceChar(),
if this is ChineseUtils.LEAVE, then do nothing,
if it is NORMALIZE then map them to the extended Latin character U+00B7, and
if it is FULLWIDTH then map them to U+30FB.
- Returns:
- The in String normalized according to the other arguments.
main
public static void main(String[] args)
throws IOException
- Mainly for testing. Usage:
ChineseUtils ascii spaceChar word*
ascii and spaceChar are integers: 0 = leave, 1 = ascii, 2 = fullwidth.
The words listed are then normalized and sent to stdout.
If no words are given, the program reads from and normalizes stdin.
Input is assumed to be in UTF-8.
- Parameters:
args
- Command line arguments as above
- Throws:
IOException
- If any problems accessing command-line files
shapeOf
public static String shapeOf(String input,
boolean augmentedDateChars,
boolean useMidDotShape)
Stanford NLP Group