edu.stanford.nlp.trees.international.pennchinese
Class CHTBTokenizer
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<java.lang.String>
edu.stanford.nlp.trees.international.pennchinese.CHTBTokenizer
- All Implemented Interfaces:
- Tokenizer<java.lang.String>, java.util.Iterator<java.lang.String>
public class CHTBTokenizer
- extends AbstractTokenizer<java.lang.String>
A simple tokenizer for tokenizing Penn Chinese Treebank files. A
token is any parenthesis, node label, or terminal. All SGML
content of the files is ignored.
- Author:
- Roger Levy
Constructor Summary |
CHTBTokenizer(java.io.Reader r)
Constructs a new tokenizer from a Reader. |
Method Summary |
java.lang.String |
getNext()
Internally fetches the next token. |
static void |
main(java.lang.String[] args)
The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CHTBTokenizer
public CHTBTokenizer(java.io.Reader r)
- Constructs a new tokenizer from a Reader. Note that getting
the bytes going into the Reader into Java-internal Unicode is
not the tokenizer's job. This can be done by converting the
file with
ConvertEncodingThread
, or by specifying
the files encoding explicitly in the Reader with
java.io.InputStreamReader
.
- Parameters:
r
- Reader
getNext
public java.lang.String getNext()
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<java.lang.String>
- Returns:
- the next token in the token stream, or null if none exists.
main
public static void main(java.lang.String[] args)
throws java.io.IOException
- The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding.
Its arguments are (Infile, Encoding).
- Throws:
java.io.IOException
Stanford NLP Group