PTBTokenizer (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.process.AbstractTokenizer<T>
- - edu.stanford.nlp.process.PTBTokenizer<T>

All Implemented Interfaces:

Tokenizer<T>, java.util.Iterator<T>
```
public class PTBTokenizer<T extends HasWord>
extends AbstractTokenizer<T>
```
A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text. It was initially written to conform to Penn Treebank tokenization conventions over ASCII text, but now provides a range of tokenization options over a broader space of Unicode text. It reads raw text and outputs tokens of classes that implement edu.stanford.nlp.trees.HasWord (typically a Word or a CoreLabel). It can optionally return end-of-line as a token.
New code is encouraged to use the PTBTokenizer(Reader,LexedTokenFactory,String) constructor. The other constructors are historical. You specify the type of result tokens with a LexedTokenFactory, and can specify the treatment of tokens by mainly boolean options given in a comma separated String options (e.g., "invertible,normalizeParentheses=true"). If the String is null or empty, you get the traditional PTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you want no normalization, then you should pass in the String "ptb3Escaping=false". The known option names are:
1. invertible: Store enough information about the original form of the token and the whitespace around it that a list of tokens can be faithfully converted back to the original String. Valid only if the LexedTokenFactory is an instance of CoreLabelTokenFactory. The keys used in it are: TextAnnotation for the tokenized form, OriginalTextAnnotation for the original string, BeforeAnnotation and AfterAnnotation for the whitespace before and after a token, and perhaps CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation to record token begin/after end character offsets, if they were specified to be recorded in TokenFactory construction. (Like the String class, begin and end are done so end - begin gives the token length.) Default is false.
2. tokenizeNLs: Whether end-of-lines should become tokens (or just be treated as part of whitespace). Default is false.
3. tokenizePerLine: Run the tokenizer separately on each line of a file. This has the following consequences: (i) A token (currently only SGML tokens) cannot span multiple lines of the original input, and (ii) The tokenizer will not examine/wait for input from the next line before deciding tokenization decisions on this line. The latter property affects treating periods by acronyms as end-of-sentence markers. Having this true is necessary to stop the tokenizer blocking and waiting for input after a newline is seen when the previous line ends with an abbreviation.
4. ptb3Escaping: Enable all traditional PTB3 token transforms (like parentheses becoming -LRB-, -RRB-). This is a macro flag that sets or clears all the options below. Note that because properties are set in a Map, if you specify both this flag and flags it sets, the resulting behaviour is non-deterministic (sorry!). (Default setting of the various properties below that this flag controls is equivalent to it being set to true.)
5. ud: [From CoreNLP 4.0] Enable options that make tokenization like what is used in UD v2. This is a macro flag that sets various of the options below. It ignores a value for this key. Note that because properties are set in a Map, if you specify both this flag and flags it sets, the resulting behaviour is non-deterministic (sorry!).
6. americanize: Whether to rewrite common British English spellings as American English spellings. (This is useful if your training material uses American English spelling, such as the Penn Treebank.) Default is true.
7. normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens. Default is true.
8. normalizeAmpersandEntity: Whether to map the XML & to an ampersand. Default is true.
9. normalizeFractions: Whether to map certain common composed fraction characters to spelled out letter forms like "1/2". Default is true.
10. normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank. Default is true.
11. normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank. Default is true.
12. quotes: [From CoreNLP 4.0] Select a style of mapping quotes. An enum with possible values (case insensitive): latex, unicode, ascii, not_cp1252, original. "ascii" maps all quote characters to the traditional ' and ". "latex" maps quotes to ``, `, ', '', as in Latex and the PTB3 WSJ (though this is now heavily frowned on in Unicode). "unicode" maps quotes to the range U+2018 to U+201D, the preferred unicode encoding of single and double quotes. "original" leaves all quotes as they were. "not_cp1252" only remaps invalid cp1252 quotes to Unicode. The default is "not_cp1252".
13. ellipses: [From CoreNLP 4.0] Select a style for mapping ellipses (3 dots). An enum with possible values (case insensitive): unicode, ptb3, not_cp1252, original. "ptb3" maps ellipses to three dots (...), the old PTB3 WSJ coding of an ellipsis. "unicode" maps three dot and space three dot sequences to U+2026, the Unicode ellipsis character. "not_cp1252" only remaps invalid cp1252 ellipses to unicode. "original" leaves all ellipses as they were. The default is "not_cp1252".
14. dashes: [From CoreNLP 4.0] Select a style for mapping dashes. An enum with possible values (case insensitive): unicode, ptb3, not_cp1252, original. "ptb3" maps dashes to "--", the most prevalent old PTB3 WSJ coding of a dash (though some are just "-" HYPHEN-MINUS). "unicode" maps "-", "--", and "---" HYPHEN-MINUS sequences and CP1252 dashes to Unicode en and em dashes. "not_cp1252" only remaps invalid cp1252 dashes to unicode. "original" leaves all dashes as they were. The default is "not_cp1252".
15. splitAssimilations: true to tokenize "gonna", false to tokenize "gon na". Default is true.
16. escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??). Default is false. This flag is no longer set by ptb3Escaping.
17. normalizeCurrency: Whether to do some awful lossy currency mappings to turn common currency characters into $, #, or "cents", reflecting the fact that nothing else appears in the old PTB3 WSJ. (No Euro!) Default is false. (Note: The default was true through CoreNLP v3.8.0, but we're gradually inching our way towards the modern world!) This flag is no longer set by ptb3Escaping.
18. untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
19. strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.
20. strictAcronym: control only the acronym portion of strictTreebank3.
21. strictFraction: control only the fraction portion of strictTreebank3.
22. splitHyphenated: whether or not to tokenize segments of hyphenated words separately ("school" "-" "aged", "frog" "-" "lipped"), keeping the exceptions in Supplementary Guidelines for ETTB 2.0 by Justin Mott, Colin Warner, Ann Bies, Ann Taylor and CLEAR guidelines (Bracketing Biomedical Text) by Colin Warner et al. (2012). Default is false, which maintains old treebank tokenizer behavior.
23. splitForwardSlash: [From CoreNLP 4.0] Whether to tokenize segments of slashed tokens separately ("Asian" "/" "Indian", "and" "/" "or"). Default is false.
A single instance of a PTBTokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a PTBTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
Author:

Tim Grow (his tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l), Teg Grenager (grenager@stanford.edu), Jenny Finkel (integrating in invertible PTB tokenizer), Christopher Manning (redid API, added many options, maintenance)

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`PTBTokenizer.PTBTokenizerFactory<T extends HasWord>` This class provides a factory which will vend instances of PTBTokenizer which wrap a provided Reader.

Field Summary
- Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
  NEWLINE_TOKEN, nextToken

Constructor Summary

Constructors
Constructor and Description
`PTBTokenizer(java.io.Reader r, LexedTokenFactory<T> tokenFactory, java.lang.String options)` Constructs a new PTBTokenizer with a custom LexedTokenFactory.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static TokenizerFactory<CoreLabel>`	`coreLabelFactory()`
`static TokenizerFactory<CoreLabel>`	`coreLabelFactory(java.lang.String options)`
`static TokenizerFactory<Word>`	`factory()` This is a historical constructor that returns Word tokens.
`static TokenizerFactory<CoreLabel>`	`factory(boolean tokenizeNLs, boolean invertible)`
`static <T extends HasWord> TokenizerFactory<T>`	`factory(LexedTokenFactory<T> factory, java.lang.String options)` Get a TokenizerFactory that does Penn Treebank tokenization.
`static java.lang.String`	`getNewlineToken()` Returns the string literal inserted for newlines when the -tokenizeNLs options is set.
`protected T`	`getNext()` Internally fetches the next token.
`static java.lang.String`	`labelList2Text(java.util.List<? extends HasWord> ptbWords)` Returns a presentable version of the given PTB-tokenized words.
`static void`	`main(java.lang.String[] args)` Reads files given as arguments and print their tokens, by default as one per line.
`static PTBTokenizer<Word>`	`newPTBTokenizer(java.io.Reader r)` Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace.
`static PTBTokenizer<CoreLabel>`	`newPTBTokenizer(java.io.Reader r, boolean tokenizeNLs, boolean invertible)` Constructs a new PTBTokenizer that makes CoreLabel tokens.
`static java.lang.String`	`ptb2Text(java.util.List<java.lang.String> ptbWords)` Returns a presentable version of the given PTB-tokenized words.
`static long`	`ptb2Text(java.io.Reader ptbText, java.io.Writer w)` Writes a presentable version of the given PTB-tokenized text.
`static java.lang.String`	`ptb2Text(java.lang.String ptbText)` Returns a presentable version of the given PTB-tokenized text.
`static java.lang.String`	`ptbToken2Text(java.lang.String ptbText)` Returns a presentable version of a given PTB token.

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.Iterator
forEachRemaining

- Constructor Detail
  - PTBTokenizer
```
public PTBTokenizer(java.io.Reader r,
                    LexedTokenFactory<T> tokenFactory,
                    java.lang.String options)
```
    Constructs a new PTBTokenizer with a custom LexedTokenFactory. Many options for tokenization and what is returned can be set via the options String. See the class documentation for details on the options String. This is the new recommended constructor!
    
    Parameters:
    
    r - The Reader to read tokens from.
    
    tokenFactory - The LexedTokenFactory to use to create tokens from the text.
    
    options - Options to the lexer. See the extensive documentation in the class javadoc. The String may be null or empty, which means that all traditional PTB normalizations are done. You can pass in "ptb3Escaping=false" and have no normalizations done (that is, the behavior of the old suppressEscaping=true option).
- Method Detail
  - newPTBTokenizer
```
public static PTBTokenizer<Word> newPTBTokenizer(java.io.Reader r)
```
    Constructs a new PTBTokenizer that returns Word tokens and which treats carriage returns as normal whitespace.
    
    Parameters:
    
    r - The Reader whose contents will be tokenized
    
    Returns:
    
    A PTBTokenizer that tokenizes a stream to objects of type Word
  - newPTBTokenizer
```
public static PTBTokenizer<CoreLabel> newPTBTokenizer(java.io.Reader r,
                                                      boolean tokenizeNLs,
                                                      boolean invertible)
```
    Constructs a new PTBTokenizer that makes CoreLabel tokens. It optionally returns carriage returns as their own token. CRs come back as CoreLabels whose text is the value of AbstractTokenizer.NEWLINE_TOKEN.
    
    Parameters:
    
    r - The Reader to read tokens from
    
    tokenizeNLs - Whether to return newlines as separate tokens (otherwise they normally disappear as whitespace)
    
    invertible - if set to true, then will produce CoreLabels which will have fields for the string before and after, and the character offsets
    
    Returns:
    
    A PTBTokenizer which returns CoreLabel objects
  - getNext
```
protected T getNext()
```
    Internally fetches the next token.
    
    Specified by:
    
    getNext in class AbstractTokenizer<T extends HasWord>
    
    Returns:
    
    the next token in the token stream, or null if none exists.
  - getNewlineToken
```
public static java.lang.String getNewlineToken()
```
    Returns the string literal inserted for newlines when the -tokenizeNLs options is set.
    
    Returns:
    
    string literal inserted for "\n".
  - ptb2Text
```
public static java.lang.String ptb2Text(java.lang.String ptbText)
```
    Returns a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.
    Note: If your tokens have maintained the OriginalTextAnnotation and the BeforeAnnotation and the AfterAnnotation, then rather than doing this you can actually precisely reconstruct the text they were made from!
    
    Parameters:
    
    ptbText - A String in PTB3-escaped form
    
    Returns:
    
    An approximation to the original String
  - ptbToken2Text
```
public static java.lang.String ptbToken2Text(java.lang.String ptbText)
```
    Returns a presentable version of a given PTB token. For instance, it transforms -LRB- into (.
  - ptb2Text
```
public static long ptb2Text(java.io.Reader ptbText,
                            java.io.Writer w)
                     throws java.io.IOException
```
    Writes a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.
    
    Throws:
    
    java.io.IOException
  - ptb2Text
```
public static java.lang.String ptb2Text(java.util.List<java.lang.String> ptbWords)
```
    Returns a presentable version of the given PTB-tokenized words. Pass in a List of Strings and this method will join the words with spaces and call ptb2Text(String) on the output.
    
    Parameters:
    
    ptbWords - A list of String
    
    Returns:
    
    A presentable version of the given PTB-tokenized words
  - labelList2Text
```
public static java.lang.String labelList2Text(java.util.List<? extends HasWord> ptbWords)
```
    Returns a presentable version of the given PTB-tokenized words. Pass in a List of Words or a Document and this method will take the word() values (to prevent additional text from creeping in, e.g., POS tags), and call ptb2Text(String) on the output.
    
    Parameters:
    
    ptbWords - A list of HasWord objects
    
    Returns:
    
    A presentable version of the given PTB-tokenized words
  - factory
```
public static TokenizerFactory<Word> factory()
```
    This is a historical constructor that returns Word tokens. Note that Word tokens don't support the extra fields to make an invertible tokenizer.
    
    Returns:
    
    A PTBTokenizerFactory that vends Word tokens.
  - factory
```
public static TokenizerFactory<CoreLabel> factory(boolean tokenizeNLs,
                                                  boolean invertible)
```
    Returns:
    
    A PTBTokenizerFactory that vends CoreLabel tokens.
  - coreLabelFactory
```
public static TokenizerFactory<CoreLabel> coreLabelFactory()
```
    Returns:
    
    A PTBTokenizerFactory that vends CoreLabel tokens with default tokenization.
  - coreLabelFactory
```
public static TokenizerFactory<CoreLabel> coreLabelFactory(java.lang.String options)
```
    Returns:
    
    A PTBTokenizerFactory that vends CoreLabel tokens with default tokenization.
  - factory
```
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory,
                                                              java.lang.String options)
```
    Get a TokenizerFactory that does Penn Treebank tokenization. This is now the recommended factory method to use.
    
    Type Parameters:
    
    T - The type of the tokens built by the LexedTokenFactory
    
    Parameters:
    
    factory - A TokenFactory that determines what form of token is returned by the Tokenizer
    
    options - A String specifying options (see the class javadoc for details)
    
    Returns:
    
    A TokenizerFactory that does Penn Treebank tokenization
  - main
```
public static void main(java.lang.String[] args)
                 throws java.io.IOException
```
    Reads files given as arguments and print their tokens, by default as one per line. This is useful either for testing or to run standalone to turn a corpus into a one-token-per-line file of tokens. This main method assumes that the input file is in utf-8 encoding, unless an encoding is specified.
    Usage: java edu.stanford.nlp.process.PTBTokenizer [options] filename+
    Options:
    - -options options Set various tokenization options (see the documentation in the class javadoc).
    - -preserveLines Produce space-separated tokens, except when the original had a line break, not one-token-per-line.
    - -oneLinePerElement Print the tokens of an element space-separated on one line. An "element" is either a file or one of the elements matched by the parseInside regex.
    - -blankLineAfterFiles Put a blank line after each input file (including after just a single input file but not at the end of input from stdin).
    - -filter regex Delete any token that matches() (in its entirety) the given regex.
    - -encoding encoding Specifies a character encoding. If you do not specify one, the default is utf-8 (not the platform default).
    - -lowerCase Lowercase all tokens (on tokenization).
    - -parseInside regex Names an XML-style element or a regular expression over such elements. The tokenizer will only tokenize inside elements that match this regex. (This is done by regex matching, not an XML parser, but works well for simple XML documents, or other SGML-style documents, such as Linguistic Data Consortium releases, which adopt the convention that a line of a file is either XML markup or character data but never both.)
    - -ioFileList file* The remaining command-line arguments are treated as filenames that themselves contain lists of pairs of input-output filenames (2 column, whitespace separated). Alternatively, if there is only one filename per line, the output filename is the input filename with ".tok" appended.
    - -fileList file* The remaining command-line arguments are treated as filenames that contain filenames, one per line. The output of tokenization is sent to stdout.
    - -dump Print the whole of each CoreLabel, not just the value (word).
    - -untok Heuristically untokenize tokenized text.
    - -h, -help Print usage info.
    A note on -preserveLines: Basically, if you use this option, your output file should have the same number of lines as your input file. If not, there is a bug. But the truth of this statement depends on how you count lines…. Unicode includes "line separator" and "paragraph separator" characters and Unicode says that you should accept them. See e.g., http://unicode.org/standard/reports/tr13/tr13-5.html
    However, Unix, Linux utilities, etc. don't recognize them and count only the traditional \n|\r|\r\n. And PTBTokenizer does normalize line separation. Hence, if your input text contains, say U+2028 Line Separator characters, the Unix wc utility will report more lines after tokenization than before, even though line breaks have been preserved, according to Unicode. It may be useful to compare results with the Perl uniwc script from https://raw.githubusercontent.com/briandfoy/Unicode-Tussle/master/script/uniwc
    If it reports the same number of input and output lines, then this difference is your problem, and in a certain Unicode sense, our tokenizer did indeed preserve the line count. If not, please send us a bug report. At present there is no way to disable this process of Unicode separator characters. If you don't want this anomaly, you'll need to either delete these two characters or to map them to conventional Unix newline characters. Or to some other weirdo character.
    Parameters:
    
    args - Command line arguments
    
    Throws:
    
    java.io.IOException - If any file I/O problem

Class PTBTokenizer<T extends HasWord>

Nested Class Summary

Field Summary

Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.Iterator

Constructor Detail

PTBTokenizer

Method Detail

newPTBTokenizer

newPTBTokenizer

getNext

getNewlineToken

ptb2Text

ptbToken2Text

ptb2Text

ptb2Text

labelList2Text

factory

factory

coreLabelFactory

coreLabelFactory

factory

main