public class PTBTokenizer<T extends HasWord> extends AbstractTokenizer<T>
New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String)
constructor. The other constructors are historical.
You specify the type of result tokens with a
LexedTokenFactory, and can specify the treatment of tokens by mainly boolean
options given in a comma separated String options
(e.g., "invertible,normalizeParentheses=true").
If the String is null
or empty, you get the traditional
PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you
want no normalization, then you should pass in the String
"ptb3Escaping=false". The known option names are:
&
to an
ampersand. Default is true. A single instance of a PTBTokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a PTBTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
Modifier and Type | Class and Description |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
This class provides a factory which will vend instances of PTBTokenizer
which wrap a provided Reader.
|
NEWLINE_TOKEN, nextToken
Constructor and Description |
---|
PTBTokenizer(java.io.Reader r,
LexedTokenFactory<T> tokenFactory,
java.lang.String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory.
|
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<CoreLabel> |
coreLabelFactory() |
static TokenizerFactory<CoreLabel> |
coreLabelFactory(java.lang.String options) |
static TokenizerFactory<Word> |
factory()
This is a historical constructor that returns Word tokens.
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeNLs,
boolean invertible) |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory,
java.lang.String options)
Get a TokenizerFactory that does Penn Treebank tokenization.
|
static java.lang.String |
getNewlineToken()
Returns the string literal inserted for newlines when the -tokenizeNLs
options is set.
|
protected T |
getNext()
Internally fetches the next token.
|
static java.lang.String |
labelList2Text(java.util.List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words.
|
static void |
main(java.lang.String[] args)
Reads files given as arguments and print their tokens, by default as
one per line.
|
static PTBTokenizer<Word> |
newPTBTokenizer(java.io.Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats
carriage returns as normal whitespace.
|
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs,
boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens.
|
static java.lang.String |
ptb2Text(java.util.List<java.lang.String> ptbWords)
Returns a presentable version of the given PTB-tokenized words.
|
static long |
ptb2Text(java.io.Reader ptbText,
java.io.Writer w)
Writes a presentable version of the given PTB-tokenized text.
|
static java.lang.String |
ptb2Text(java.lang.String ptbText)
Returns a presentable version of the given PTB-tokenized text.
|
static java.lang.String |
ptbToken2Text(java.lang.String ptbText)
Returns a presentable version of a given PTB token.
|
hasNext, next, peek, remove, tokenize
public PTBTokenizer(java.io.Reader r, LexedTokenFactory<T> tokenFactory, java.lang.String options)
r
- The Reader to read tokens from.tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.options
- Options to the lexer. See the extensive documentation
in the class javadoc. The String may be null or empty,
which means that all traditional PTB normalizations are
done. You can pass in "ptb3Escaping=false" and have no
normalizations done (that is, the behavior of the old
suppressEscaping=true option).public static PTBTokenizer<Word> newPTBTokenizer(java.io.Reader r)
r
- The Reader whose contents will be tokenizedWord
public static PTBTokenizer<CoreLabel> newPTBTokenizer(java.io.Reader r, boolean tokenizeNLs, boolean invertible)
AbstractTokenizer.NEWLINE_TOKEN
.r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsetsprotected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public static java.lang.String getNewlineToken()
public static java.lang.String ptb2Text(java.lang.String ptbText)
Note: If your tokens have maintained the OriginalTextAnnotation and the BeforeAnnotation and the AfterAnnotation, then rather than doing this you can actually precisely reconstruct the text they were made from!
ptbText
- A String in PTB3-escaped formpublic static java.lang.String ptbToken2Text(java.lang.String ptbText)
public static long ptb2Text(java.io.Reader ptbText, java.io.Writer w) throws java.io.IOException
java.io.IOException
public static java.lang.String ptb2Text(java.util.List<java.lang.String> ptbWords)
ptb2Text(String)
on the
output.ptbWords
- A list of Stringpublic static java.lang.String labelList2Text(java.util.List<? extends HasWord> ptbWords)
ptb2Text(String)
on the output.ptbWords
- A list of HasWord objectspublic static TokenizerFactory<Word> factory()
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs, boolean invertible)
public static TokenizerFactory<CoreLabel> coreLabelFactory()
public static TokenizerFactory<CoreLabel> coreLabelFactory(java.lang.String options)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, java.lang.String options)
T
- The type of the tokens built by the LexedTokenFactoryfactory
- A TokenFactory that determines what form of token is returned by the Tokenizeroptions
- A String specifying options (see the class javadoc for details)public static void main(java.lang.String[] args) throws java.io.IOException
Usage: java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
A note on -preserveLines
: Basically, if you use this option, your output file should have
the same number of lines as your input file. If not, there is a bug. But the truth of this statement
depends on how you count lines…. Unicode includes "line separator" and "paragraph separator" characters
and Unicode says that you should accept them. See e.g., http://unicode.org/standard/reports/tr13/tr13-5.html
However, Unix, Linux utilities, etc. don't recognize them and count only the traditional \n|\r|\r\n. And PTBTokenizer does normalize line separation. Hence, if your input text contains, say U+2028 Line Separator characters, the Unix wc utility will report more lines after tokenization than before, even though line breaks have been preserved, according to Unicode. It may be useful to compare results with the Perl uniwc script from https://raw.githubusercontent.com/briandfoy/Unicode-Tussle/master/script/uniwc
If it reports the same number of input and output lines, then this difference is your problem, and in a certain Unicode sense, our tokenizer did indeed preserve the line count. If not, please send us a bug report. At present there is no way to disable this process of Unicode separator characters. If you don't want this anomaly, you'll need to either delete these two characters or to map them to conventional Unix newline characters. Or to some other weirdo character.
args
- Command line argumentsjava.io.IOException
- If any file I/O problem