edu.stanford.nlp.process
Class PTBTokenizer<T extends HasWord>
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.process.PTBTokenizer<T>
- All Implemented Interfaces:
- Tokenizer<T>, java.util.Iterator<T>
public class PTBTokenizer<T extends HasWord>
- extends AbstractTokenizer<T>
Fast, rule-based tokenizer implementation, initially written to
conform to the Penn Treebank tokenization conventions, but now providing
a range of tokenization options over a broader space of Unicode text.
It reads raw text and outputs
tokens of classes that implement edu.stanford.nlp.trees.HasWord
(typically a Word or a CoreLabel). It can
optionally return end-of-line as a token.
New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String)
constructor. The other constructors are historical.
You specify the type of result tokens with a
LexedTokenFactory, and can specify the treatment of tokens by mainly boolean
options given in a comma separated String options
(e.g., "invertible,normalizeParentheses=true").
If the String is null
or empty, you get the traditional
PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you
want no normalization, then you should pass in the String
"ptb3Escaping=false". The known option names are:
- invertible: Store enough information about the original form of the
token and the whitespace around it that a list of tokens can be
faithfully converted back to the original String. Valid only if the
LexedTokenFactory is an instance of CoreLabelTokenFactory. The
keys used in it are: TextAnnotation for the tokenized form,
OriginalTextAnnotation for the original string, BeforeAnnotation and
AfterAnnotation for the whitespace before and after a token, and
perhaps CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation to record
token begin/after end character offsets, if they were specified to be recorded
in TokenFactory construction. (Like the String class, begin and end
are done so end - begin gives the token length.)
- tokenizeNLs: Whether end-of-lines should become tokens (or just
be treated as part of whitespace)
- ptb3Escaping: Enable all traditional PTB3 token transforms
(like parentheses becoming -LRB-, -RRB-). This is a macro flag that
sets or clears all the options below.
- americanize: Whether to rewrite common British English spellings
as American English spellings
- normalizeSpace: Whether any spaces in tokens (phone numbers, fractions
get turned into U+00A0 (non-breaking space). It's dangerous to turn
this off for most of our Stanford NLP software, which assumes no
spaces in tokens.
- normalizeAmpersandEntity: Whether to map the XML & to an
ampersand
- normalizeCurrency: Whether to do some awful lossy currency mappings
to turn common currency characters into $, #, or "cents", reflecting
the fact that nothing else appears in the old PTB3 WSJ. (No Euro!)
- normalizeFractions: Whether to map certain common composed
fraction characters to spelled out letter forms like "1/2"
- normalizeParentheses: Whether to map round parentheses to -LRB-,
-RRB-, as in the Penn Treebank
- normalizeOtherBrackets: Whether to map other common bracket characters
to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
- asciiQuotes Whether to map quote characters to the traditional ' and "
- latexQuotes: Whether to map to ``, `, ', '' for quotes, as in Latex
and the PTB3 WSJ (though this is now heavily frowned on in Unicode).
If true, this takes precedence over the setting of unicodeQuotes;
if both are false, no mapping is done.
- unicodeQuotes: Whether to map quotes to the range U+2018 to U+201D,
the preferred unicode encoding of single and double quotes.
- ptb3Ellipsis: Whether to map ellipses to three dots (...), the old PTB3 WSJ coding
of an ellipsis. If true, this takes precedence over the setting of
unicodeEllipsis; if both are false, no mapping is done.
- unicodeEllipsis: Whether to map dot and optional space sequences to
U+2026, the Unicode ellipsis character
- ptb3Dashes: Whether to turn various dash characters into "--",
the dominant encoding of dashes in the PTB3 WSJ
- escapeForwardSlashAsterisk: Whether to put a backslash escape in front
of / and * as the old PTB3 WSJ does for some reason (something to do
with Lisp readers??).
- untokenizable: What to do with untokenizable characters (ones not
known to the tokenizer). Six options combining whether to log a
warning for none, the first, or all, and whether to delete them or
to include them as single character tokens in the output: noneDelete,
firstDelete, allDelete, noneKeep, firstKeep, allKeep.
The default is "firstDelete".
- strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3
WSJ tokenization in two cases. Setting this improves compatibility
for those cases. They are: (i) When an acronym is followed by a
sentence end, such as "U.S." at the end of a sentence, the PTB3
has tokens of "U.S" and ".", while by default PTBTokenizer duplicates
the period returning tokens of "U.S." and ".", and (ii) PTBTokenizer
will return numbers with a whole number and a fractional part like
"5 7/8" as a single token (with a non-breaking space in the middle),
while the PTB3 separates them into two tokens "5" and "7/8".
- Author:
- Tim Grow (his tokenizer is a Java implementation of Professor
Chris Manning's Flex tokenizer, pgtt-treebank.l), Teg Grenager (grenager@stanford.edu), Jenny Finkel (integrating in invertible PTB tokenizer), Christopher Manning (redid API, added many options, maintenance)
Constructor Summary |
PTBTokenizer(java.io.Reader r,
LexedTokenFactory<T> tokenFactory,
java.lang.String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory. |
Method Summary |
static TokenizerFactory<Word> |
factory()
|
static TokenizerFactory<CoreLabel> |
factory(boolean tokenizeNLs,
boolean invertible)
|
static
|
factory(boolean tokenizeNLs,
LexedTokenFactory<T> factory)
|
static
|
factory(LexedTokenFactory<T> factory,
java.lang.String options)
Get a TokenizerFactory that does Penn Treebank tokenization. |
protected T |
getNext()
Internally fetches the next token. |
static java.lang.String |
labelList2Text(java.util.List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
static void |
main(java.lang.String[] args)
Reads files named as arguments and print their tokens, by default as
one per line. |
static PTBTokenizer<Word> |
newPTBTokenizer(java.io.Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats
carriage returns as normal whitespace. |
static PTBTokenizer<Word> |
newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs)
Constructs a new PTBTokenizer that optionally returns newlines
as their own token. |
static PTBTokenizer<CoreLabel> |
newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs,
boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens. |
static java.lang.String |
ptb2Text(java.util.List<java.lang.String> ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
static int |
ptb2Text(java.io.Reader ptbText,
java.io.Writer w)
Writes a presentable version of the given PTB-tokenized text. |
static java.lang.String |
ptb2Text(java.lang.String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
static java.lang.String |
ptbToken2Text(java.lang.String ptbText)
Returns a presentable version of a given PTB token. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
PTBTokenizer
public PTBTokenizer(java.io.Reader r,
LexedTokenFactory<T> tokenFactory,
java.lang.String options)
- Constructs a new PTBTokenizer with a custom LexedTokenFactory.
Many options for tokenization and what is returned can be set via
the options String. See the class documentation for details on
the options String. This is the new recommended constructor!
- Parameters:
r
- The Reader to read tokens from.tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.options
- Options to the lexer. See the extensive documentation
in the class javadoc. The String may be null or empty,
which means that all traditional PTB normalizations are
done. You can pass in "ptb3Escaping=false" and have no
normalizations done (that is, the behavior of the old
suppressEscaping=true option).
newPTBTokenizer
public static PTBTokenizer<Word> newPTBTokenizer(java.io.Reader r)
- Constructs a new PTBTokenizer that returns Word tokens and which treats
carriage returns as normal whitespace.
- Parameters:
r
- The Reader whose contents will be tokenized
- Returns:
- A PTBTokenizer that tokenizes a stream to objects of type
Word
newPTBTokenizer
public static PTBTokenizer<Word> newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs)
- Constructs a new PTBTokenizer that optionally returns newlines
as their own token. NLs come back as Words whose text is
the value of
PTBLexer.NEWLINE_TOKEN
.
- Parameters:
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)
- Returns:
- A PTBTokenizer which returns Word tokens
newPTBTokenizer
public static PTBTokenizer<CoreLabel> newPTBTokenizer(java.io.Reader r,
boolean tokenizeNLs,
boolean invertible)
- Constructs a new PTBTokenizer that makes CoreLabel tokens.
It optionally returns carriage returns
as their own token. CRs come back as Words whose text is
the value of
PTBLexer.NEWLINE_TOKEN
.
- Parameters:
r
- The Reader to read tokens fromtokenizeNLs
- Whether to return newlines as separate tokens
(otherwise they normally disappear as whitespace)invertible
- if set to true, then will produce CoreLabels which
will have fields for the string before and after, and the
character offsets
- Returns:
- A PTBTokenizer which returns CoreLabel objects
getNext
protected T getNext()
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<T extends HasWord>
- Returns:
- the next token in the token stream, or null if none exists.
ptb2Text
public static java.lang.String ptb2Text(java.lang.String ptbText)
- Returns a presentable version of the given PTB-tokenized text.
PTB tokenization splits up punctuation and does various other things
that makes simply joining the tokens with spaces look bad. So join
the tokens with space and run it through this method to produce nice
looking text. It's not perfect, but it works pretty well.
- Parameters:
ptbText
- A String in PTB3-escaped form
- Returns:
- An approximation to the original String
ptbToken2Text
public static java.lang.String ptbToken2Text(java.lang.String ptbText)
- Returns a presentable version of a given PTB token. For instance,
it transforms -LRB- into (.
ptb2Text
public static int ptb2Text(java.io.Reader ptbText,
java.io.Writer w)
throws java.io.IOException
- Writes a presentable version of the given PTB-tokenized text.
PTB tokenization splits up punctuation and does various other things
that makes simply joining the tokens with spaces look bad. So join
the tokens with space and run it through this method to produce nice
looking text. It's not perfect, but it works pretty well.
- Throws:
java.io.IOException
ptb2Text
public static java.lang.String ptb2Text(java.util.List<java.lang.String> ptbWords)
- Returns a presentable version of the given PTB-tokenized words.
Pass in a List of Strings and this method will
join the words with spaces and call
ptb2Text(String)
on the
output.
- Parameters:
ptbWords
- A list of String
- Returns:
- A presentable version of the given PTB-tokenized words
labelList2Text
public static java.lang.String labelList2Text(java.util.List<? extends HasWord> ptbWords)
- Returns a presentable version of the given PTB-tokenized words.
Pass in a List of Words or a Document and this method will
join the words with spaces and call
ptb2Text(String)
on the
output. This method will take the word() values to prevent additional
text from creeping in (e.g., POS tags).
- Parameters:
ptbWords
- A list of HasWord objects
- Returns:
- A presentable version of the given PTB-tokenized words
factory
public static TokenizerFactory<Word> factory()
factory
public static <T extends HasWord> TokenizerFactory<T> factory(boolean tokenizeNLs,
LexedTokenFactory<T> factory)
factory
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs,
boolean invertible)
factory
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory,
java.lang.String options)
- Get a TokenizerFactory that does Penn Treebank tokenization.
This is now the recommended factory method to use.
- Type Parameters:
T
- The type of the tokens built by the LexedTokenFactory- Parameters:
factory
- A TokenFactory that determines what form of token is returned by the Tokenizeroptions
- A String specifying options (see the class javadoc for details)
- Returns:
- A TokenizerFactory that does Penn Treebank tokenization
main
public static void main(java.lang.String[] args)
throws java.io.IOException
- Reads files named as arguments and print their tokens, by default as
one per line. This is useful either for testing or to run
standalone to turn a corpus into a one-token-per-line file of tokens.
This main method assumes that the input file is in utf-8 encoding,
unless it is specified.
Usage:
java edu.stanford.nlp.process.PTBTokenizer [options] filename+
Options:
- -options options Set various tokenization options
(see the documentation in the class javadoc)
- -preserveLines Produce space-separated tokens, except
when the original had a line break, not one-token-per-line
- -encoding encoding Specifies a character encoding
- -lowerCase Lowercase all tokens (on tokenization)
- -parseInside regex Names an XML-style tag or a regular expression
over such elements. The tokenizer will only tokenize inside element
that match this name. (This is done by regex matching, not an XML
parser, but works well for simple XML documents, or other SGML-style
documents, such as Linguistic Data Consortium releases, which adopt
the convention that a line of a file is either XML markup or
character data but never both.)
- -ioFileList file* The remaining command-line arguments are treated as
filenames that themselves contain lists of pairs of input-output
filenames (2 column, whitespace separated).
- -dump Print the whole of each CoreLabel, not just the value (word)
- -untok Heuristically untokenize tokenized text
- -h Print usage info
- Parameters:
args
- Command line arguments
- Throws:
java.io.IOException
- If any file I/O problem
Stanford NLP Group