edu.stanford.nlp.trees.international.pennchinese
Class ChineseUtils

java.lang.Object
  extended by edu.stanford.nlp.trees.international.pennchinese.ChineseUtils

public class ChineseUtils
extends Object

This class contains a few String constants and static methods for dealing with Chinese text.

Warning: The code contains a version that uses codePoint methods to handle full Unicode. But it seems to tickle some bugs in Sun's JDK 1.5. It works correctly with JDK 1.6. By default it is disabled and a version that only handles BMP characters is used. The latter prints a warning message if it sees a high-surrogate character.

Author:
Christopher Manning

Field Summary
static int ASCII
           
static int DELETE
           
static int DELETE_EXCEPT_BETWEEN_ASCII
           
static int FULLWIDTH
           
static int LEAVE
           
static int MAX_LEGAL
           
static String MID_DOT_REGEX_STR
           
static int NORMALIZE
           
static String NUMBERS
           
static String ONEWHITE
           
static String WHITE
           
static String WHITEPLUS
           
 
Method Summary
static boolean isNumber(char c)
           
static void main(String[] args)
          Mainly for testing.
static String normalize(String in)
           
static String normalize(String in, int ascii, int spaceChar)
           
static String normalize(String in, int ascii, int spaceChar, int midDot)
          This will normalize a Unicode String in various ways.
static String shapeOf(String input, boolean augmentedDateChars, boolean useMidDotShape)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ONEWHITE

public static final String ONEWHITE
See Also:
Constant Field Values

WHITE

public static final String WHITE
See Also:
Constant Field Values

WHITEPLUS

public static final String WHITEPLUS
See Also:
Constant Field Values

NUMBERS

public static final String NUMBERS
See Also:
Constant Field Values

MID_DOT_REGEX_STR

public static final String MID_DOT_REGEX_STR
See Also:
Constant Field Values

LEAVE

public static final int LEAVE
See Also:
Constant Field Values

ASCII

public static final int ASCII
See Also:
Constant Field Values

NORMALIZE

public static final int NORMALIZE
See Also:
Constant Field Values

FULLWIDTH

public static final int FULLWIDTH
See Also:
Constant Field Values

DELETE

public static final int DELETE
See Also:
Constant Field Values

DELETE_EXCEPT_BETWEEN_ASCII

public static final int DELETE_EXCEPT_BETWEEN_ASCII
See Also:
Constant Field Values

MAX_LEGAL

public static final int MAX_LEGAL
See Also:
Constant Field Values
Method Detail

isNumber

public static boolean isNumber(char c)

normalize

public static String normalize(String in)

normalize

public static String normalize(String in,
                               int ascii,
                               int spaceChar)

normalize

public static String normalize(String in,
                               int ascii,
                               int spaceChar,
                               int midDot)
This will normalize a Unicode String in various ways. This routine correctly handles characters outside the basic multilingual plane.

Parameters:
in - The String to be normalized
ascii - For characters conceptually in the ASCII range of ! through ~ (U+0021 through U+007E or U+FF01 through U+FF5E), if this is ChineseUtils.LEAVE, then do nothing, if it is ASCII then map them from the Chinese Full Width range to ASCII values, and if it is FULLWIDTH then do the reverse.
spaceChar - For characters that satisfy Character.isSpaceChar(), if this is ChineseUtils.LEAVE, then do nothing, if it is ASCII then map them to the space character U+0020, and if it is FULLWIDTH then map them to U+3000.
midDot - For characters that satisfy Character.isSpaceChar(), if this is ChineseUtils.LEAVE, then do nothing, if it is NORMALIZE then map them to the extended Latin character U+00B7, and if it is FULLWIDTH then map them to U+30FB.
Returns:
The in String normalized according to the other arguments.

main

public static void main(String[] args)
                 throws IOException
Mainly for testing. Usage: ChineseUtils ascii spaceChar word*

ascii and spaceChar are integers: 0 = leave, 1 = ascii, 2 = fullwidth. The words listed are then normalized and sent to stdout. If no words are given, the program reads from and normalizes stdin. Input is assumed to be in UTF-8.

Parameters:
args - Command line arguments as above
Throws:
IOException - If any problems accessing command-line files

shapeOf

public static String shapeOf(String input,
                             boolean augmentedDateChars,
                             boolean useMidDotShape)


Stanford NLP Group