|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
public class PTBTokenizer<T extends HasWord>
Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
|
Field Summary |
---|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r,
boolean tokenizeCRs,
LexedTokenFactory<T> tokenFactory)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. |
Method Summary | ||
---|---|---|
static TokenizerFactory<Word> |
factory()
|
|
static TokenizerFactory<Word> |
factory(boolean tokenizeCRs)
|
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeCRs,
boolean invertible)
|
|
static TokenizerFactory<Word> |
factory(boolean tokenizeCRs,
boolean invertible,
boolean suppressEscaping)
|
|
static
|
factory(boolean tokenizeCRs,
LexedTokenFactory<T> factory)
|
|
protected T |
getNext()
Internally fetches the next token. |
|
static String |
labelList2Text(List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static void |
main(String[] args)
Reads files named as arguments and print their tokens, by default as one per line. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs,
boolean invertible)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static String |
ptb2Text(List<String> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static int |
ptb2Text(Reader ptbText,
Writer w)
Returns a presentable version of the given PTB-tokenized text. |
|
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
|
static String |
ptbToken2Text(String ptbText)
Returns a presentable version of a given PTB token. |
|
void |
setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory<T> tokenFactory)
PTBLexer.cr
.
r
- The Reader to read tokens fromtokenizeCRs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.Method Detail |
---|
public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
r
- The Reader whose contents will be tokenized
Word
public static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeCRs)
PTBLexer.cr
.
r
- The Reader to read tokens fromtokenizeCRs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)
public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeCRs, boolean invertible)
PTBLexer.cr
.
r
- The Reader to read tokens fromtokenizeCRs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsets
protected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public final void setSource(Reader r)
r
- The Reader to tokenize frompublic static String ptb2Text(String ptbText)
public static String ptbToken2Text(String ptbText)
public static int ptb2Text(Reader ptbText, Writer w) throws IOException
IOException
public static String ptb2Text(List<String> ptbWords)
ptb2Text(String)
on the
output.
ptbWords
- A list of String
public static String labelList2Text(List<? extends HasWord> ptbWords)
ptb2Text(String)
on the
output. This method will take the word() values to prevent additional
text from creeping in (e.g., POS tags).
ptbWords
- A list of HasWord objects
public static TokenizerFactory<Word> factory()
public static TokenizerFactory<Word> factory(boolean tokenizeCRs)
public static <T extends HasWord> TokenizerFactory<T> factory(boolean tokenizeCRs, LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeCRs, boolean invertible)
public static TokenizerFactory<Word> factory(boolean tokenizeCRs, boolean invertible, boolean suppressEscaping)
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
args
- Command line arguments
IOException
- If any file I/O problem
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |