|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
public class PTBTokenizer<T extends HasWord>
Fast, rule-based tokenizer implementation, initially written to conform to the Penn Treebank tokenization conventions, but now providing a range of tokenization options over a broader space of Unicode text. It reads raw text and outputs tokens of classes that implement edu.stanford.nlp.trees.HasWord (typically a Word or a CoreLabel). It can optionally return end-of-line as a token.
New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String)
constructor. The other constructors are historical.
You specify the type of result tokens with a
LexedTokenFactory, and can specify the treatment of tokens by mainly boolean
options given in a comma separated String options
(e.g., "invertible,normalizeParentheses=true").
If the String is null
or empty, you get the traditional
PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you
want no normalization, then you should pass in the String
"ptb3Escaping=false". The known option names are:
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
This class provides a factory which will vend instances of PTBTokenizer which wrap a provided Reader. |
Field Summary |
---|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r,
LexedTokenFactory<T> tokenFactory,
String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory. |
Method Summary | ||
---|---|---|
static TokenizerFactory<Word> |
factory()
|
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeNLs,
boolean invertible)
|
|
static
|
factory(boolean tokenizeNLs,
LexedTokenFactory<T> factory)
|
|
static
|
factory(LexedTokenFactory<T> factory,
String options)
Get a TokenizerFactory that does Penn Treebank tokenization. |
|
protected T |
getNext()
Internally fetches the next token. |
|
static String |
labelList2Text(List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static void |
main(String[] args)
Reads files named as arguments and print their tokens, by default as one per line. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace. |
|
static PTBTokenizer<Word> |
newPTBTokenizer(Reader r,
boolean tokenizeNLs)
Constructs a new PTBTokenizer that optionally returns newlines as their own token. |
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(Reader r,
boolean tokenizeNLs,
boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens. |
|
static String |
ptb2Text(List<String> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
|
static int |
ptb2Text(Reader ptbText,
Writer w)
Writes a presentable version of the given PTB-tokenized text. |
|
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
|
static String |
ptbToken2Text(String ptbText)
Returns a presentable version of a given PTB token. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PTBTokenizer(Reader r, LexedTokenFactory<T> tokenFactory, String options)
r
- The Reader to read tokens from.tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.options
- Options to the lexer. See the extensive documentation
in the class javadoc. The String may be null or empty,
which means that all traditional PTB normalizations are
done. You can pass in "ptb3Escaping=false" and have no
normalizations done (that is, the behavior of the old
suppressEscaping=true option).Method Detail |
---|
public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
r
- The Reader whose contents will be tokenized
Word
public static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeNLs)
PTBLexer.NEWLINE_TOKEN
.
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)
public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeNLs, boolean invertible)
PTBLexer.NEWLINE_TOKEN
.
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsets
protected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public static String ptb2Text(String ptbText)
ptbText
- A String in PTB3-escaped form
public static String ptbToken2Text(String ptbText)
public static int ptb2Text(Reader ptbText, Writer w) throws IOException
IOException
public static String ptb2Text(List<String> ptbWords)
ptb2Text(String)
on the
output.
ptbWords
- A list of String
public static String labelList2Text(List<? extends HasWord> ptbWords)
ptb2Text(String)
on the
output. This method will take the word() values to prevent additional
text from creeping in (e.g., POS tags).
ptbWords
- A list of HasWord objects
public static TokenizerFactory<Word> factory()
public static <T extends HasWord> TokenizerFactory<T> factory(boolean tokenizeNLs, LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs, boolean invertible)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, String options)
T
- The type of the tokens built by the LexedTokenFactoryfactory
- A TokenFactory that determines what form of token is returned by the Tokenizeroptions
- A String specifying options (see the class javadoc for details)
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
args
- Command line arguments
IOException
- If any file I/O problem
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |