This is a short document
about how to use non-English character-set encodings, and Unicode, in
Java. It’s biased toward
people working on Chinese, but applies to everyone.
Internally, Java is ALWAYS unicode. That
means that the char primitive is always Unicode,
as well as the text content of the following classes:
String
java.util.regex.Pattern
java.util.regex.Matcher
all Readers (StringReader, BufferedReader, ...)
all Writers (StringWriter, PrintWriter, ...)
So once you've got any text
in these classes you should always be treating it as unicode.
For Chinese, this means that it will always be one character per Chinese
character. So there is no formal
distinction between, say Big5, GB, and Unicode representations of Chinese
characters internal to Java.
Internal to its own world,
Java handles input and output using the Reader and Writer classes (plus their relevant subclasses), which read
from and write to character streams.
As an NLP person, this is generally the world you want to be thinking
in.
In the world external to
Java, all characters are not created equal. The mapping from an abstract character
to a readable or writable bit-sequence is specified by a character set
encoding. Except for certain
minimal extensions of the ASCII character set encoding, most character set
encodings do not map each character to an identical-length bit sequence. For example, Chinese characters in Big5
(
But as someone working on
natural language, you normally want to avoid thinking of the outside world in
terms of byte streams. You want to
think of it in terms of character streams.
You can do this worry-free by using two crucial classes that connect the
two worlds:
InputStreamReader, which
connects an InputStream to a Reader
OutputStreamWriter, which connects a Writer to an OutputStream
There are constructors for
these two classes that allow you to specify a character encoding:
InputStreamReader(InputStream in, String charsetName)
OutputStreamWriter(OutputStream out, String charsetName)
So if I wanted to read GB
(simplified Chinese) text from a file <filename>, I would use
(1) a FileInputStream, together with
(2) an InputStreamReader
and I would do
InputStream
inputStream = new FileInputStream("<filename>");
Reader r = new InputStreamReader(inputStream,"GB18030");
and then I would just read
characters from the Reader r
however I wanted. (I might also want to wrap r in a BufferedReader to gain access to some convenience methods.)
This
webpage lists the character-set encodings supported by Java 1.5.