Stanford Tokenizer

About | Obtaining | Usage | Mailing Lists

About

A tokenizer divides text into a sequence of tokens, which roughly correspond to "words". We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely mimic Penn Treebank 3 (PTB) tokenization, hence its name, though over time the tokenizer has added quite a few options and a fair amount of Unicode compatibility, so in general it will work well over text encoded in the Unicode Basic Multilingual Plane that does not require word segmentation (such as writing systems that do not put spaces between words) or more exotic language-particular rules (such as writing systems that use : or ? as a character inside words, etc.). An ancillary tool uses this tokenization to provide the ability to split text into sentences. PTBTokenizer mainly targets formal English writing rather than SMS-speak.

PTBTokenizer is a an efficient, fast, deterministic tokenizer. (For the more technically inclined, it is implemented as a finite automaton, produced by JFlex.) On a typical 2010 computer, it will tokenize text at a rate of about 200,000 tokens per second. While deterministic, it uses some quite good heuristics, so it can usually decide when single quotes are parts of words, when periods do an don't imply sentence boundaries, etc. Sentence splitting is a deterministic consequence of tokenization: a sentence ends when a sentence-ending character (., !, or ?) is found which is not grouped with other characters into a token (such as for an abbreviation or number), though it may still include a few tokens that can follow a sentence ending character as part of the same sentence (such as quotes and brackets).

PTBTokenizer has been developed by Christopher Manning, Tim Grow, Teg Grenager, Jenny Finkel, and John Bauer.

Obtaining

The Stanford Tokenizer is not distributed separately but is included in several of our software downloads, including the Stanford Parser, Stanford Part-of-Speech Tagger, Stanford Named Entity Recognizer, and Stanford CoreNLP. Choose a tool that has been updated recently, download it, and you're ready to go. See these software packages for details on software licenses.

Usage

The tokenizer requires Java (JDK1.5+). As well as API access, the program includes an easy-to-use command-line interface, PTBTokenizer. For the examples below, we assume you have set up your CLASSPATH to find PTBTokenizer, for example with a command like the following (the details depend on your operating system and shell):

export CLASSPATH=stanford-parser.jar
You can also specify this on each command-line by adding -cp stanford-parser.jar after java.

Command-line usage

The basic operation is to convert a plain text file into a sequence of tokens, which are printed out one per line. Here is an example (on Unix):

$ cat >sample.txt
"Oh, no," she's saying, "our $400 blender can't handle something this hard!"
$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
``
Oh
,
no
,
''
she
's
saying
,
``
our
$
400
blender
ca
n't
handle
something
this
hard
!
''
PTBTokenizer tokenized 23 tokens at 370.97 tokens per second.

Here, we gave a filename argument which contained the text. PTBTokenizer can also read from a gzip-compressed file or a URL, or it can run as a filter, reading from stdin. There are a bunch of other things it can do, using command-line flags:

The output of PTBTokenizer can be post-processed to divide a test into sentences. One way to get the output of that from the command-line is through calling edu.stanfordn.nlp.process.DocumentPreprocessor. For example:

$ cat >sample.txt
Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through
three rounds of last month's U.S. Open. H.J. Heinz Company said it
completed the sale of its Ore-Ida frozen-food business catering to the
service industry to McCain Foods Ltd. for about $500 million.
It's the first group action of its kind in Britain and one of
only a handful of lawsuits against tobacco companies outside the
U.S. A Paris lawyer last year sued France's Seita SA on behalf of
two cancer-stricken smokers. Japan Tobacco Inc. faces a suit from
five smokers who accuse the government-owned company of hooking
them on an addictive product.
$
$ java edu.stanford.nlp.process.DocumentPreprocessor sample.txt 
Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through
three rounds of last month 's U.S. Open .
H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food
business catering to the service industry to McCain Foods Ltd. for about
$ 500 million .
It 's the first group action of its kind in Britain and one of only a
handful of lawsuits against tobacco companies outside the U.S. .
A Paris lawyer last year sued France 's Seita SA on behalf of two
cancer-stricken smokers .
Japan Tobacco Inc. faces a suit from five smokers who accuse the
government-owned company of hooking them on an addictive product .
Read in 5 sentences.

API usage

There are various ways to call the code, but here's a simple example to get started with using either PTBTokenizer directly or calling DocumentPreprocessor.

import java.io.FileReader;
import java.io.IOException;
import java.util.List;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;

public class TokenizerDemo {

  public static void main(String[] args) throws IOException {
    for (String arg : args) {
      // option #1: By sentence.
      DocumentPreprocessor dp = new DocumentPreprocessor(arg);
      for (List sentence : dp) {
        System.out.println(sentence);
      }
      // option #2: By token
      PTBTokenizer ptbt = new PTBTokenizer(new FileReader(arg),
              new CoreLabelTokenFactory(), "");
      for (CoreLabel label; ptbt.hasNext(); ) {
        label = ptbt.next();
        System.out.println(label);
      }
    }
  }
}

Options

There are a number of options that affect how tokenization is performed. These can be specified on the command line, with the flag -options (or -tokenizerOptions in tools like the Stanford Parser) or in the constructor to PTBTokenizer or the factory methods in PTBTokenizerFactory. Here are the current options. They are specified as a single string, with options separated by commas, and values given in option=value syntax, for instance "americanize=false,unicodeQuotes=true,unicodeEllipsis=true".

Mailing Lists

We have 3 mailing lists for the Stanford Classifier, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu:

  1. java-nlp-user This is the best list to post to in order to ask questions, make announcements, or for discussion among JavaNLP users. You have to subscribe to be able to use it. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.
  2. java-nlp-announce This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 message a year). Join the list via via this webpage or by emailing java-nlp-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
  3. java-nlp-support This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, please join and use java-nlp-user. You cannot join java-nlp-support, but you can mail questions to java-nlp-support@lists.stanford.edu.