|
|
About | Obtaining | Usage | Mailing Lists
A tokenizer divides text into a sequence of tokens, which roughly correspond to "words". We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely mimic Penn Treebank 3 (PTB) tokenization, hence its name, though over time the tokenizer has added quite a few options and a fair amount of Unicode compatibility, so in general it will work well over text encoded in the Unicode Basic Multilingual Plane that does not require word segmentation (such as writing systems that do not put spaces between words) or more exotic language-particular rules (such as writing systems that use : or ? as a character inside words, etc.). An ancillary tool uses this tokenization to provide the ability to split text into sentences. PTBTokenizer mainly targets formal English writing rather than SMS-speak.
PTBTokenizer is a an efficient, fast, deterministic tokenizer. (For the more technically inclined, it is implemented as a finite automaton, produced by JFlex.) On a typical 2010 computer, it will tokenize text at a rate of about 200,000 tokens per second. While deterministic, it uses some quite good heuristics, so it can usually decide when single quotes are parts of words, when periods do an don't imply sentence boundaries, etc. Sentence splitting is a deterministic consequence of tokenization: a sentence ends when a sentence-ending character (., !, or ?) is found which is not grouped with other characters into a token (such as for an abbreviation or number), though it may still include a few tokens that can follow a sentence ending character as part of the same sentence (such as quotes and brackets).
PTBTokenizer has been developed by Christopher Manning, Tim Grow, Teg Grenager, Jenny Finkel, and John Bauer.
The Stanford Tokenizer is not distributed separately but is included in several of our software downloads, including the Stanford Parser, Stanford Part-of-Speech Tagger, Stanford Named Entity Recognizer, and Stanford CoreNLP. Choose a tool that has been updated recently, download it, and you're ready to go. See these software packages for details on software licenses.
The tokenizer requires Java (JDK1.5+). As well as API
access, the program includes an easy-to-use
command-line interface, PTBTokenizer. For the examples
below, we assume you have set up your CLASSPATH to find
PTBTokenizer, for example with a command like the following
(the details depend on your operating system and shell):
You can also specify this on each command-line by addingexport CLASSPATH=stanford-parser.jar
-cp
stanford-parser.jar after java.
The basic operation is to convert a plain text file into a sequence of tokens, which are printed out one per line. Here is an example (on Unix):
$ cat >sample.txt "Oh, no," she's saying, "our $400 blender can't handle something this hard!" $ java edu.stanford.nlp.process.PTBTokenizer sample.txt `` Oh , no , '' she 's saying , `` our $ 400 blender ca n't handle something this hard ! '' PTBTokenizer tokenized 23 tokens at 370.97 tokens per second.
Here, we gave a filename argument which contained the text. PTBTokenizer can also read from a gzip-compressed file or a URL, or it can run as a filter, reading from stdin. There are a bunch of other things it can do, using command-line flags:
$ java edu.stanford.nlp.process.PTBTokenizer -preserveLines < sample.txt | java edu.stanford.nlp.process.PTBTokenizer -untok > roundtrip.txt $ diff sample.txt roundtrip.txt $
The output of PTBTokenizer can be post-processed to divide a test into
sentences. One way to get the output of that from the command-line is
through
calling edu.stanfordn.nlp.process.DocumentPreprocessor.
For example:
$ cat >sample.txt Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. A Paris lawyer last year sued France's Seita SA on behalf of two cancer-stricken smokers. Japan Tobacco Inc. faces a suit from five smokers who accuse the government-owned company of hooking them on an addictive product. $ $ java edu.stanford.nlp.process.DocumentPreprocessor sample.txt Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month 's U.S. Open . H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $ 500 million . It 's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. . A Paris lawyer last year sued France 's Seita SA on behalf of two cancer-stricken smokers . Japan Tobacco Inc. faces a suit from five smokers who accuse the government-owned company of hooking them on an addictive product . Read in 5 sentences.
There are various ways to call the code, but here's a simple example to
get started with using either PTBTokenizer directly or
calling DocumentPreprocessor.
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
public class TokenizerDemo {
public static void main(String[] args) throws IOException {
for (String arg : args) {
// option #1: By sentence.
DocumentPreprocessor dp = new DocumentPreprocessor(arg);
for (List sentence : dp) {
System.out.println(sentence);
}
// option #2: By token
PTBTokenizer ptbt = new PTBTokenizer(new FileReader(arg),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = ptbt.next();
System.out.println(label);
}
}
}
}
There are a number of options that affect how tokenization is
performed. These can be specified on the command line, with the flag
-options (or -tokenizerOptions in tools like the
Stanford Parser) or in the constructor to PTBTokenizer or
the factory methods in PTBTokenizerFactory. Here are the
current options. They are specified as a single string, with options
separated by commas, and values given in option=value syntax, for
instance
"americanize=false,unicodeQuotes=true,unicodeEllipsis=true".
We have 3 mailing lists for the Stanford Classifier, all of which are shared
with other JavaNLP tools (with the exclusion of the parser). Each address is
at @lists.stanford.edu:
java-nlp-user This is the best list to post to in order
to ask questions, make announcements, or for discussion among JavaNLP
users. You have to subscribe to be able to use it.
Join the list via this webpage or by emailing
java-nlp-user-join@lists.stanford.edu. (Leave the
subject and message body empty.) You can also
look at
the list archives.
java-nlp-announce This list will be used only to announce
new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
message a year). Join the list via via this webpage or by emailing
java-nlp-announce-join@lists.stanford.edu. (Leave the
subject and message body empty.)
java-nlp-support This list goes only to the software
maintainers. It's a good address for licensing questions, etc. For
general use and support questions, please join and use
java-nlp-user.
You cannot join java-nlp-support, but you can mail questions to
java-nlp-support@lists.stanford.edu.
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · machines · Wiki · Calendar |
Site design by Bill MacCartney |