|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
public class PTBTokenizer<T>
Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T>
|
Field Summary | |
---|---|
static String |
AFTER_KEY
|
static String |
BEFORE_KEY
|
static String |
CURRENT_KEY
|
static String |
END_POSITION_KEY
|
static String |
START_POSITION_KEY
|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r,
boolean tokenizeCRs,
LexedTokenFactory<T> tokenFactory)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. |
Method Summary | ||
---|---|---|
static TokenizerFactory<Word> |
factory()
|
|
static TokenizerFactory<Word> |
factory(boolean tokenizeCRs)
|
|
static TokenizerFactory<FeatureLabel> |
factory(boolean tokenizeCRs,
boolean invertible)
|
|
static
|
factory(boolean tokenizeCRs,
LexedTokenFactory<T> factory)
|
|
protected T |
getNext()
Internally fetches the next token. |
|
static void |
main(String[] args)
Reads files named as arguments and print their tokens one per line. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static PTBTokenizer<FeatureLabel> |
newPTBTokenizer(Reader r,
boolean tokenizeCRs,
boolean invertible)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
static String |
ptb2Text(List ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
|
void |
setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String BEFORE_KEY
public static final String CURRENT_KEY
public static final String AFTER_KEY
public static final String START_POSITION_KEY
public static final String END_POSITION_KEY
Constructor Detail |
---|
public PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory<T> tokenFactory)
PTBLexer.cr
.
tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.Method Detail |
---|
public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
public static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeCRs)
PTBLexer.cr
.
public static PTBTokenizer<FeatureLabel> newPTBTokenizer(Reader r, boolean tokenizeCRs, boolean invertible)
PTBLexer.cr
.
invertible
- if set to true, then will produce FeatureLabels which
will have fields for the string before and after, and the character offsetsprotected T getNext()
getNext
in class AbstractTokenizer<T>
public void setSource(Reader r)
public static String ptb2Text(String ptbText)
public static String ptb2Text(List ptbWords)
ptb2Text(String)
on the
output. This method will check if the elements in the list are subtypes
of Word, and if so, it will take the word() values to prevent additional
text from creeping in (e.g., POS tags). Otherwise the toString value will
be used.
public static TokenizerFactory<Word> factory()
public static TokenizerFactory<Word> factory(boolean tokenizeCRs)
public static <T> TokenizerFactory<T> factory(boolean tokenizeCRs, LexedTokenFactory<T> factory)
public static TokenizerFactory<FeatureLabel> factory(boolean tokenizeCRs, boolean invertible)
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [-charset charset] [-nl] filename+
Options:
args
- Command line arguments
IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |