The Stanford Natural Language Processing Group

Stanford Tokenizer

About

A tokenizer divides text into a sequence of tokens, which roughly correspond to "words". We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely mimic Penn Treebank 3 (PTB) tokenization, hence its name, though over time the tokenizer has added quite a few options and a fair amount of Unicode compatibility, so in general it will work well over text encoded in Unicode that does not require word segmentation (such as writing systems that do not put spaces between words) or more exotic language-particular rules (such as writing systems that use : or ? as a character inside words, etc.). In 2017 it was upgraded to support non-Basic Multilingual Plane Unicode, in particular, to support emoji. 😍 We also have corresponding tokenizers FrenchTokenizer and SpanishTokenizer for French and Spanish. We use the Stanford Word Segmenter for languages like Chinese and Arabic. An ancillary tool DocumentPreprocessor uses this tokenization to provide the ability to split text into sentences. PTBTokenizer mainly targets formal English writing rather than SMS-speak.

PTBTokenizer is a an efficient, fast, deterministic tokenizer. (For the more technically inclined, it is implemented as a finite automaton, produced by JFlex.) While deterministic, it uses some quite good heuristics, so it can usually decide when single quotes are parts of words, when periods do an don't imply sentence boundaries, etc. Sentence splitting is a deterministic consequence of tokenization: a sentence ends when a sentence-ending character (., !, or ?) is found which is not grouped with other characters into a token (such as for an abbreviation or number), though the sentence may still include a few tokens that can follow a sentence ending character as part of the same sentence (such as quotes and brackets).

PTBTokenizer has been developed by Christopher Manning, Tim Grow, Teg Grenager, Jenny Finkel, and John Bauer.

Obtaining

The Stanford Tokenizer is not distributed separately but is included in several of our software downloads, including the Stanford Parser, Stanford Part-of-Speech Tagger, Stanford Named Entity Recognizer, and Stanford CoreNLP. Choose a tool, download it, and you're ready to go. See these software packages for details on software licenses.

Usage

The tokenizer requires Java (now, Java 8). As well as API access, the program includes an easy-to-use command-line interface, PTBTokenizer. For the examples below, we assume you have set up your CLASSPATH to find PTBTokenizer, for example with a command like the following (the details depend on your operating system and shell):

export CLASSPATH=stanford-parser.jar

You can also specify this on each command-line by adding

-cp
stanford-parser.jar

after java.

Command-line usage

The basic operation is to convert a plain text file into a sequence of tokens, which are printed out one per line. Here is an example (on Unix):

$ cat >sample.txt
"Oh, no," she's saying, "our $400 blender can't handle something this hard!"
$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
``
Oh
,
no
,
''
she
's
saying
,
``
our
$
400
blender
ca
n't
handle
something
this
hard
!
''
PTBTokenizer tokenized 23 tokens at 370.97 tokens per second.

Here, we gave a filename argument which contained the text. PTBTokenizer can also read from a gzip-compressed file or a URL, or it can run as a filter, reading from stdin. There are a bunch of other things it can do, using command-line flags.

Command-line/main method flags

-encoding charset The character set encoding. By default, it assumues utf-8, but you can tell it to use another character encoding.
-preserveLines Keep the input line breaks rather than changing things to one token per line.
-oneLinePerElement Print the tokens of an element space-separated on one line. An "element" is either an XML element matched by the parseInside regular expression or else is the entire contents of a file, if there is no such expression specified.
-parseInside regex Only tokenize information inside the SGML/XML elements which match the regex. This is regex-based matching of SGML/XML, and so isn't perfect, but works perfectly well with simple SGML/XML such as LDC corpora, such as English Gigaword (for which the regex you'll probably want is "HEADLINE|P").
-filter regex Delete any token that matches() (in its entirety) the given regex.
-lowerCase Lowercase tokens (using English conventions) prior to printing out.
-dump Print out everything about each token. (Find out how we really represent tokens!)
-options optionString Lets you set a bunch of options that affect tokenization; see below.
-ioFileList file+ Treat the files on the command line as files that themselves contain lists of files to process. These files should be formatted in two tab-separated columns of input files and corresponding output files.
-fileList file+ The remaining command-line arguments are treated as filenames that themselves contain filenames, one per line. The output of tokenization is sent to stdout.

-untok Makes a best effort attempt at undoing PTB tokenization. Slightly less perfect than the tokenization but not bad. It doesn't join tokens over newlines, though.

$ java edu.stanford.nlp.process.PTBTokenizer -preserveLines < sample.txt | java edu.stanford.nlp.process.PTBTokenizer -untok > roundtrip.txt
$ diff sample.txt roundtrip.txt
$

-help or -h Print some usage information.

Sentence segmentation

The output of PTBTokenizer can be post-processed to divide a text into sentences. One way to get the output of that from the command-line is through calling edu.stanfordn.nlp.process.DocumentPreprocessor. The other is to use the sentence splitter in CoreNLP. For example:

$ cat >sample.txt
Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through
three rounds of last month's U.S. Open. H.J. Heinz Company said it
completed the sale of its Ore-Ida frozen-food business catering to the
service industry to McCain Foods Ltd. for about $500 million.
It's the first group action of its kind in Britain and one of
only a handful of lawsuits against tobacco companies outside the
U.S. A Paris lawyer last year sued France's Seita SA on behalf of
two cancer-stricken smokers. Japan Tobacco Inc. faces a suit from
five smokers who accuse the government-owned company of hooking
them on an addictive product.
$
$ java edu.stanford.nlp.process.DocumentPreprocessor sample.txt 
Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through
three rounds of last month 's U.S. Open .
H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food
business catering to the service industry to McCain Foods Ltd. for about
$ 500 million .
It 's the first group action of its kind in Britain and one of only a
handful of lawsuits against tobacco companies outside the U.S. .
A Paris lawyer last year sued France 's Seita SA on behalf of two
cancer-stricken smokers .
Japan Tobacco Inc. faces a suit from five smokers who accuse the
government-owned company of hooking them on an addictive product .
Read in 5 sentences.

API usage

There are various ways to call the code, but here's a simple example to get started with, showing using either PTBTokenizer directly or calling DocumentPreprocessor.

import java.io.FileReader;
import java.io.IOException;
import java.util.List;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;

public class TokenizerDemo {

  public static void main(String[] args) throws IOException {
    for (String arg : args) {
      // option #1: By sentence.
      DocumentPreprocessor dp = new DocumentPreprocessor(arg);
      for (List<HasWord> sentence : dp) {
        System.out.println(sentence);
      }
      // option #2: By token
      PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<>(new FileReader(arg),
              new CoreLabelTokenFactory(), "");
      while (ptbt.hasNext()) {
        CoreLabel label = ptbt.next();
        System.out.println(label);
      }
    }
  }
}

Options

There are a number of options that affect how tokenization is performed. These can be specified on the command line, with the flag -options (or -tokenizerOptions in tools like the Stanford Parser) or in the constructor to PTBTokenizer or the factory methods in PTBTokenizerFactory. Here are the current options. They are specified as a single string, with options separated by commas, and values given in option=value syntax, for instance "americanize=false,unicodeQuotes=true,unicodeEllipsis=true".

invertible: Store enough information about the original form of the token and the whitespace around it that a list of tokens can be faithfully converted back to the original String. Valid only if the LexedTokenFactory is an instance of CoreLabelTokenFactory. The keys used are: TextAnnotation for the tokenized form, OriginalTextAnnotation for the original string, BeforeAnnotation and AfterAnnotation for the whitespace before and after a token, and perhaps BeginPositionAnnotation and EndPositionAnnotation to record token begin/after end character offsets, if they were specified to be recorded in TokenFactory construction. (Like the String class, begin and end are done so end - begin gives the token length.)
tokenizeNLs: Whether end-of-lines should become tokens (or just be treated as part of whitespace).
tokenizePerLine: Run the tokenizer separately on each line of a file. This has the following consequences: (i) A token (currently only SGML tokens) cannot span multiple lines of the original input, and (ii) The tokenizer will not examine/wait for input from the next line before deciding tokenization decisions on this line. The latter property affects treating periods by acronyms as end-of-sentence markers. Use this option for strictly line-oriented processing: Having this true is necessary to stop the tokenizer blocking and waiting for input after a newline is seen when the previous line ends with an abbreviation.
ptb3Escaping: Enable all traditional PTB3 token transforms (like parentheses becoming -LRB-, -RRB-). This is a macro flag that sets or clears all the options below.
americanize: Whether to rewrite common British English spellings as American English spellings
normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens.
normalizeAmpersandEntity: Whether to map the XML & to an ampersand.
normalizeCurrency: Whether to do some awful lossy currency mappings to turn common currency characters into $, #, or "cents", reflecting the fact that nothing else appears in the old PTB3 WSJ. (No Euro!)
normalizeFractions: Whether to map certain common composed fraction characters to spelled out letter forms like "1/2"
normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
keepAssimilations: true to tokenize "gonna", false to tokenize "gon na". Default is true.
dashes: How to handle dashes. dashes=PTB will turn dashes into "--", the dominant encoding of dashes in the PTB3 WSJ. There is also UNICODE, NOT_CP1252, and ORIGINAL
ellipses: Whether to map dot and optional space sequences to U+2026, the Unicode ellipsis character. Same options as dashes.
quotes: Whether to map to ``, `, ', '' for quotes, as in Latex and the PTB3 WSJ (though this is now heavily frowned on in Unicode). UNICODE maps quotes to U+2018 to U+201D, the preferred unicode encoding of single and double quotes.
escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).
untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.
splitHyphenated: Whether or not to tokenize segments of hyphenated words separately ("school" "-" "aged", "frog" "-" "lipped"), keeping together the exceptions in Supplementary Guidelines for ETTB 2.0 by Justin Mott, Colin Warner, Ann Bies, Ann Taylor and CLEAR guidelines (Bracketing Biomedical Text) by Colin Warner et al. (2012). Default is currently false, which maintains old treebank tokenizer behavior. (This default will likely change in a future release.)

Questions

For asking questions, see our support page.

Performance: Speed

PTBTokenizer is a fast compiled finite automaton. This has some disadvantages, limiting the extent to which behavior can be changed at runtime, but means that it is very fast. Here are some statistics measured on a MacBook Pro (15 inch, 2016) with a 2.7 GHz Intel Core i7 proccessor (4 cores, 256kb L2 cache per core, 8MB L3 cache) running Java 9, and for statistics involving disk, using an SSD using Stanford NLP v3.9.1. The documents used were NYT newswire from LDC English Gigaword 5.

PTBTokenizer Configuration	Tokens/second	Ave. time per Gigaword document
Tokenizing document Strings in memory	4.51 million	0.18 ms.
Tokenizing from disk to disk	3.15 million	0.25 ms.

For comparison, we tried to directly time the speed of the SpaCy tokenizer v.2.0.11 under Python v.3.5.4. (Note: this is SpaCy v2, not v1. We believe the figures in their speed benchmarks are still reporting numbers from SpaCy v1, which was apparently much faster than v2). Here are the timings we got:

SpaCy Configuration	Tokens/second	Ave. time per Gigaword document
Tokenizing document Strings in memory	180 thousand	4.7 ms.
Tokenizing from disk to disk	125 thousand	6.5 ms.

Indeed, we find that, using the stanfordcorenlp Python wrapper, you can tokenize with CoreNLP in Python in about 70% of the time that SpaCy v2 takes, even though a lot of the speed difference necessarily goes away while marshalling data into json, sending it via http and then reassembling it from json.

Software > Stanford Tokenizer