public class PTBTokenizer<T extends HasWord> extends AbstractTokenizer<T>
New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String)
constructor. The other constructors are historical.
You specify the type of result tokens with a
LexedTokenFactory, and can specify the treatment of tokens by mainly boolean
options given in a comma separated String options
(e.g., "invertible,normalizeParentheses=true").
If the String is null
or empty, you get the traditional
PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you
want no normalization, then you should pass in the String
"ptb3Escaping=false". The known option names are:
A single instance of a PTBTokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a PTBTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
Modifier and Type | Class and Description |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
This class provides a factory which will vend instances of PTBTokenizer
which wrap a provided Reader.
|
nextToken
Constructor and Description |
---|
PTBTokenizer(java.io.Reader r,
LexedTokenFactory<T> tokenFactory,
java.lang.String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory.
|
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<CoreLabel> |
coreLabelFactory() |
static TokenizerFactory<CoreLabel> |
coreLabelFactory(java.lang.String options) |
static TokenizerFactory<Word> |
factory() |
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeNLs,
boolean invertible) |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory,
java.lang.String options)
Get a TokenizerFactory that does Penn Treebank tokenization.
|
static java.lang.String |
getNewlineToken()
Returns the string literal inserted for newlines when the -tokenizeNLs
options is set.
|
protected T |
getNext()
Internally fetches the next token.
|
static java.lang.String |
labelList2Text(java.util.List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words.
|
static void |
main(java.lang.String[] args)
Reads files given as arguments and print their tokens, by default as
one per line.
|
static PTBTokenizer<Word> |
newPTBTokenizer(java.io.Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats
carriage returns as normal whitespace.
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs,
boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens.
|
static java.lang.String |
ptb2Text(java.util.List<java.lang.String> ptbWords)
Returns a presentable version of the given PTB-tokenized words.
|
static int |
ptb2Text(java.io.Reader ptbText,
java.io.Writer w)
Writes a presentable version of the given PTB-tokenized text.
|
static java.lang.String |
ptb2Text(java.lang.String ptbText)
Returns a presentable version of the given PTB-tokenized text.
|
static java.lang.String |
ptbToken2Text(java.lang.String ptbText)
Returns a presentable version of a given PTB token.
|
hasNext, next, peek, remove, tokenize
public PTBTokenizer(java.io.Reader r, LexedTokenFactory<T> tokenFactory, java.lang.String options)
r
- The Reader to read tokens from.tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.options
- Options to the lexer. See the extensive documentation
in the class javadoc. The String may be null or empty,
which means that all traditional PTB normalizations are
done. You can pass in "ptb3Escaping=false" and have no
normalizations done (that is, the behavior of the old
suppressEscaping=true option).public static PTBTokenizer<Word> newPTBTokenizer(java.io.Reader r)
r
- The Reader whose contents will be tokenizedWord
public static PTBTokenizer<CoreLabel> newPTBTokenizer(java.io.Reader r, boolean tokenizeNLs, boolean invertible)
PTBLexer.NEWLINE_TOKEN
.r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsetsprotected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public static java.lang.String getNewlineToken()
public static java.lang.String ptb2Text(java.lang.String ptbText)
Note: If your tokens have maintained the OriginalTextAnnotation and the BeforeAnnotation and the AfterAnnotation, then rather than doing this you can actually precisely reconstruct the text they were made from!
ptbText
- A String in PTB3-escaped formpublic static java.lang.String ptbToken2Text(java.lang.String ptbText)
public static int ptb2Text(java.io.Reader ptbText, java.io.Writer w) throws java.io.IOException
java.io.IOException
public static java.lang.String ptb2Text(java.util.List<java.lang.String> ptbWords)
ptb2Text(String)
on the
output.ptbWords
- A list of Stringpublic static java.lang.String labelList2Text(java.util.List<? extends HasWord> ptbWords)
ptb2Text(String)
on the
output. This method will take the word() values to prevent additional
text from creeping in (e.g., POS tags).ptbWords
- A list of HasWord objectspublic static TokenizerFactory<Word> factory()
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs, boolean invertible)
public static TokenizerFactory<CoreLabel> coreLabelFactory()
public static TokenizerFactory<CoreLabel> coreLabelFactory(java.lang.String options)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, java.lang.String options)
T
- The type of the tokens built by the LexedTokenFactoryfactory
- A TokenFactory that determines what form of token is returned by the Tokenizeroptions
- A String specifying options (see the class javadoc for details)public static void main(java.lang.String[] args) throws java.io.IOException
java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
args
- Command line argumentsjava.io.IOException
- If any file I/O problem