edu.stanford.nlp.process
Class PTBTokenizer<T extends HasWord>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<T>
      extended by edu.stanford.nlp.process.PTBTokenizer<T>
All Implemented Interfaces:
Tokenizer<T>, Iterator<T>

public class PTBTokenizer<T extends HasWord>
extends AbstractTokenizer<T>

Fast, rule-based tokenizer implementation, initially written to conform to the Penn Treebank tokenization conventions, but now providing a range of tokenization options over a broader space of Unicode text. It reads raw text and outputs tokens of classes that implement edu.stanford.nlp.trees.HasWord (typically a Word or a CoreLabel). It can optionally return end-of-line as a token.

New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String) constructor. The other constructors are historical. You specify the type of result tokens with a LexedTokenFactory, and can specify the treatment of tokens by mainly boolean options given in a comma separated String options (e.g., "invertible,normalizeParentheses=true"). If the String is null or empty, you get the traditional PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you want no normalization, then you should pass in the String "ptb3Escaping=false". The known option names are:

  1. invertible: Store enough information about the original form of the token and the whitespace around it that a list of tokens can be faithfully converted back to the original String. Valid only if the LexedTokenFactory is an instance of CoreLabelTokenFactory. The keys used in it are: TextAnnotation for the tokenized form, OriginalTextAnnotation for the original string, BeforeAnnotation and AfterAnnotation for the whitespace before and after a token, and perhaps CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation to record token begin/after end character offsets, if they were specified to be recorded in TokenFactory construction. (Like the String class, begin and end are done so end - begin gives the token length.)
  2. tokenizeNLs: Whether end-of-lines should become tokens (or just be treated as part of whitespace)
  3. ptb3Escaping: Enable all traditional PTB3 token transforms (like parentheses becoming -LRB-, -RRB-). This is a macro flag that sets or clears all the options below.
  4. americanize: Whether to rewrite common British English spellings as American English spellings
  5. normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens.
  6. normalizeAmpersandEntity: Whether to map the XML &amp; to an ampersand
  7. normalizeCurrency: Whether to do some awful lossy currency mappings to turn common currency characters into $, #, or "cents", reflecting the fact that nothing else appears in the old PTB3 WSJ. (No Euro!)
  8. normalizeFractions: Whether to map certain common composed fraction characters to spelled out letter forms like "1/2"
  9. normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
  10. normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
  11. asciiQuotes Whether to map quote characters to the traditional ' and "
  12. latexQuotes: Whether to map to ``, `, ', '' for quotes, as in Latex and the PTB3 WSJ (though this is now heavily frowned on in Unicode). If true, this takes precedence over the setting of unicodeQuotes; if both are false, no mapping is done.
  13. unicodeQuotes: Whether to map quotes to the range U+2018 to U+201D, the preferred unicode encoding of single and double quotes.
  14. ptb3Ellipsis: Whether to map ellipses to three dots (...), the old PTB3 WSJ coding of an ellipsis. If true, this takes precedence over the setting of unicodeEllipsis; if both are false, no mapping is done.
  15. unicodeEllipsis: Whether to map dot and optional space sequences to U+2026, the Unicode ellipsis character
  16. ptb3Dashes: Whether to turn various dash characters into "--", the dominant encoding of dashes in the PTB3 WSJ
  17. escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).
  18. untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
  19. strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.S." at the end of a sentence, the PTB3 has tokens of "U.S" and ".", while by default PTBTokenizer duplicates the period returning tokens of "U.S." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token (with a non-breaking space in the middle), while the PTB3 separates them into two tokens "5" and "7/8".

Author:
Tim Grow (his tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l), Teg Grenager (grenager@stanford.edu), Jenny Finkel (integrating in invertible PTB tokenizer), Christopher Manning (redid API, added many options, maintenance)

Nested Class Summary
static class PTBTokenizer.PTBTokenizerFactory<T extends HasWord>
          This class provides a factory which will vend instances of PTBTokenizer which wrap a provided Reader.
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
PTBTokenizer(Reader r, LexedTokenFactory<T> tokenFactory, String options)
          Constructs a new PTBTokenizer with a custom LexedTokenFactory.
 
Method Summary
static TokenizerFactory<Word> factory()
           
static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs, boolean invertible)
           
static
<T extends HasWord>
TokenizerFactory<T>
factory(boolean tokenizeNLs, LexedTokenFactory<T> factory)
           
static
<T extends HasWord>
TokenizerFactory<T>
factory(LexedTokenFactory<T> factory, String options)
          Get a TokenizerFactory that does Penn Treebank tokenization.
protected  T getNext()
          Internally fetches the next token.
static String labelList2Text(List<? extends HasWord> ptbWords)
          Returns a presentable version of the given PTB-tokenized words.
static void main(String[] args)
          Reads files named as arguments and print their tokens, by default as one per line.
static PTBTokenizer<Word> newPTBTokenizer(Reader r)
          Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace.
static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeNLs)
          Constructs a new PTBTokenizer that optionally returns newlines as their own token.
static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeNLs, boolean invertible)
          Constructs a new PTBTokenizer that makes CoreLabel tokens.
static String ptb2Text(List<String> ptbWords)
          Returns a presentable version of the given PTB-tokenized words.
static int ptb2Text(Reader ptbText, Writer w)
          Writes a presentable version of the given PTB-tokenized text.
static String ptb2Text(String ptbText)
          Returns a presentable version of the given PTB-tokenized text.
static String ptbToken2Text(String ptbText)
          Returns a presentable version of a given PTB token.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PTBTokenizer

public PTBTokenizer(Reader r,
                    LexedTokenFactory<T> tokenFactory,
                    String options)
Constructs a new PTBTokenizer with a custom LexedTokenFactory. Many options for tokenization and what is returned can be set via the options String. See the class documentation for details on the options String. This is the new recommended constructor!

Parameters:
r - The Reader to read tokens from.
tokenFactory - The LexedTokenFactory to use to create tokens from the text.
options - Options to the lexer. See the extensive documentation in the class javadoc. The String may be null or empty, which means that all traditional PTB normalizations are done. You can pass in "ptb3Escaping=false" and have no normalizations done (that is, the behavior of the old suppressEscaping=true option).
Method Detail

newPTBTokenizer

public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace.

Parameters:
r - The Reader whose contents will be tokenized
Returns:
A PTBTokenizer that tokenizes a stream to objects of type Word

newPTBTokenizer

public static PTBTokenizer<Word> newPTBTokenizer(Reader r,
                                                 boolean tokenizeNLs)
Constructs a new PTBTokenizer that optionally returns newlines as their own token. NLs come back as Words whose text is the value of PTBLexer.NEWLINE_TOKEN.

Parameters:
r - The Reader to read tokens from
tokenizeNLs - Whether to return newlines as separate tokens (otherwise they normally disappear as whitespace)
Returns:
A PTBTokenizer which returns Word tokens

newPTBTokenizer

public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r,
                                                      boolean tokenizeNLs,
                                                      boolean invertible)
Constructs a new PTBTokenizer that makes CoreLabel tokens. It optionally returns carriage returns as their own token. CRs come back as Words whose text is the value of PTBLexer.NEWLINE_TOKEN.

Parameters:
r - The Reader to read tokens from
tokenizeNLs - Whether to return newlines as separate tokens (otherwise they normally disappear as whitespace)
invertible - if set to true, then will produce CoreLabels which will have fields for the string before and after, and the character offsets
Returns:
A PTBTokenizer which returns CoreLabel objects

getNext

protected T getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<T extends HasWord>
Returns:
the next token in the token stream, or null if none exists.

ptb2Text

public static String ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.

Parameters:
ptbText - A String in PTB3-escaped form
Returns:
An approximation to the original String

ptbToken2Text

public static String ptbToken2Text(String ptbText)
Returns a presentable version of a given PTB token. For instance, it transforms -LRB- into (.


ptb2Text

public static int ptb2Text(Reader ptbText,
                           Writer w)
                    throws IOException
Writes a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.

Throws:
IOException

ptb2Text

public static String ptb2Text(List<String> ptbWords)
Returns a presentable version of the given PTB-tokenized words. Pass in a List of Strings and this method will join the words with spaces and call ptb2Text(String) on the output.

Parameters:
ptbWords - A list of String
Returns:
A presentable version of the given PTB-tokenized words

labelList2Text

public static String labelList2Text(List<? extends HasWord> ptbWords)
Returns a presentable version of the given PTB-tokenized words. Pass in a List of Words or a Document and this method will join the words with spaces and call ptb2Text(String) on the output. This method will take the word() values to prevent additional text from creeping in (e.g., POS tags).

Parameters:
ptbWords - A list of HasWord objects
Returns:
A presentable version of the given PTB-tokenized words

factory

public static TokenizerFactory<Word> factory()

factory

public static <T extends HasWord> TokenizerFactory<T> factory(boolean tokenizeNLs,
                                                              LexedTokenFactory<T> factory)

factory

public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs,
                                                  boolean invertible)

factory

public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory,
                                                              String options)
Get a TokenizerFactory that does Penn Treebank tokenization. This is now the recommended factory method to use.

Type Parameters:
T - The type of the tokens built by the LexedTokenFactory
Parameters:
factory - A TokenFactory that determines what form of token is returned by the Tokenizer
options - A String specifying options (see the class javadoc for details)
Returns:
A TokenizerFactory that does Penn Treebank tokenization

main

public static void main(String[] args)
                 throws IOException
Reads files named as arguments and print their tokens, by default as one per line. This is useful either for testing or to run standalone to turn a corpus into a one-token-per-line file of tokens. This main method assumes that the input file is in utf-8 encoding, unless it is specified.

Usage: java edu.stanford.nlp.process.PTBTokenizer [options] filename+

Options:

Parameters:
args - Command line arguments
Throws:
IOException - If any file I/O problem


Stanford NLP Group