|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
public class PTBTokenizer<T>
Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T>
|
Field Summary |
---|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r,
boolean tokenizeCRs,
LexedTokenFactory<T> tokenFactory)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. |
Method Summary | ||
---|---|---|
static TokenizerFactory<Word> |
factory()
|
|
static TokenizerFactory<Word> |
factory(boolean tokenizeCRs)
|
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeCRs,
boolean invertible)
|
|
static TokenizerFactory<Word> |
factory(boolean tokenizeCRs,
boolean invertible,
boolean suppressEscaping)
|
|
static
|
factory(boolean tokenizeCRs,
LexedTokenFactory<T> factory)
|
|
protected T |
getNext()
Internally fetches the next token. |
|
static void |
main(String[] args)
Reads files named as arguments and print their tokens one per line. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs,
boolean invertible)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static String |
ptb2Text(List ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static int |
ptb2Text(Reader ptbText,
Writer w)
Returns a presentable version of the given PTB-tokenized text. |
|
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
|
void |
setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory<T> tokenFactory)
PTBLexer.cr
.
tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.Method Detail |
---|
public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
r
- The Reader whose contents will be tokenized
Word
public static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeCRs)
PTBLexer.cr
.
public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeCRs, boolean invertible)
PTBLexer.cr
.
invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the character offsetsprotected T getNext()
getNext
in class AbstractTokenizer<T>
public void setSource(Reader r)
r
- The Reader to tokenize frompublic static String ptb2Text(String ptbText)
public static int ptb2Text(Reader ptbText, Writer w) throws IOException
IOException
public static String ptb2Text(List ptbWords)
ptb2Text(String)
on the
output. This method will check if the elements in the list are subtypes
of Word, and if so, it will take the word() values to prevent additional
text from creeping in (e.g., POS tags). Otherwise the toString value will
be used.
Implementation note: At the moment, this can be called on either a
List of String or Word. The typing should be cleaned up at some point.
public static TokenizerFactory<Word> factory()
public static TokenizerFactory<Word> factory(boolean tokenizeCRs)
public static <T> TokenizerFactory<T> factory(boolean tokenizeCRs, LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeCRs, boolean invertible)
public static TokenizerFactory<Word> factory(boolean tokenizeCRs, boolean invertible, boolean suppressEscaping)
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [-charset charset] [-nl] filename+
Options:
args
- Command line arguments
IOException
- If any file I/O problem
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |