|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
public class PTBTokenizer<T extends HasWord>
Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
|
Field Summary |
---|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r,
LexedTokenFactory<T> tokenFactory,
String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory. |
Method Summary | ||
---|---|---|
static TokenizerFactory<Word> |
factory()
|
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeNLs,
boolean invertible)
|
|
static
|
factory(boolean tokenizeNLs,
LexedTokenFactory<T> factory)
|
|
static
|
factory(LexedTokenFactory<T> factory,
String options)
|
|
protected T |
getNext()
Internally fetches the next token. |
|
static String |
labelList2Text(List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static void |
main(String[] args)
Reads files named as arguments and print their tokens, by default as one per line. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r,
boolean tokenizeNLs)
Constructs a new PTBTokenizer that optionally returns newlines as their own token. |
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(Reader r,
boolean tokenizeNLs,
boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens. |
|
static String |
ptb2Text(List<String> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static int |
ptb2Text(Reader ptbText,
Writer w)
Writes a presentable version of the given PTB-tokenized text. |
|
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
|
static String |
ptbToken2Text(String ptbText)
Returns a presentable version of a given PTB token. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PTBTokenizer(Reader r, LexedTokenFactory<T> tokenFactory, String options)
r
- The Reader to read tokens fromtokenFactory
- The LexedTokenFactory to use to create
tokens from the text.options
- Options to the lexer. See the extensive documentation
in PTBLexer. The String may be null or empty, which means that
all traditional PTB normalizations are done. You can pass in
"ptb3Escaping=false" and have no normalizations done (that is,
the behavior of the old suppressEscaping=true option).Method Detail |
---|
public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
r
- The Reader whose contents will be tokenized
Word
public static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeNLs)
PTBLexer.NEWLINE_TOKEN
.
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)
public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeNLs, boolean invertible)
PTBLexer.NEWLINE_TOKEN
.
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsets
protected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public static String ptb2Text(String ptbText)
ptbText
- A String in PTB3-escaped form
public static String ptbToken2Text(String ptbText)
public static int ptb2Text(Reader ptbText, Writer w) throws IOException
IOException
public static String ptb2Text(List<String> ptbWords)
ptb2Text(String)
on the
output.
ptbWords
- A list of String
public static String labelList2Text(List<? extends HasWord> ptbWords)
ptb2Text(String)
on the
output. This method will take the word() values to prevent additional
text from creeping in (e.g., POS tags).
ptbWords
- A list of HasWord objects
public static TokenizerFactory<Word> factory()
public static <T extends HasWord> TokenizerFactory<T> factory(boolean tokenizeNLs, LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs, boolean invertible)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, String options)
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
args
- Command line arguments
IOException
- If any file I/O problem
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |