edu.stanford.nlp.trees.international.pennchinese
Class CHTBTokenizer
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<String>
edu.stanford.nlp.trees.international.pennchinese.CHTBTokenizer
- All Implemented Interfaces:
- Tokenizer<String>, Iterator<String>
public class CHTBTokenizer
- extends AbstractTokenizer<String>
A simple tokenizer for tokenizing Penn Chinese Treebank files. A
token is any parenthesis, node label, or terminal. All SGML
content of the files is ignored.
- Author:
- Roger Levy
Method Summary |
String |
getNext()
Internally fetches the next token. |
static void |
main(String[] args)
The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CHTBTokenizer
public CHTBTokenizer(Reader r)
- Constructs a new tokenizer from a Reader. Note that getting
the bytes going into the Reader into Java-internal Unicode is
not the tokenizer's job. This can be done by converting the
file with
ConvertEncodingThread
, or by specifying
the files encoding explicitly in the Reader with
java.io.InputStreamReader
.
- Parameters:
r
- Reader
getNext
public String getNext()
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<String>
- Returns:
- the next token in the token stream, or null if none exists.
main
public static void main(String[] args)
throws IOException
- The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding.
Its arguments are (Infile, Encoding).
- Throws:
IOException
Stanford NLP Group