edu.stanford.nlp.process
Class PTBTokenizer<T>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<T>
      extended by edu.stanford.nlp.process.PTBTokenizer<T>
All Implemented Interfaces:
Tokenizer<T>, Iterator<T>

public class PTBTokenizer<T>
extends AbstractTokenizer<T>

Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.

Author:
Tim Grow, Teg Grenager (grenager@stanford.edu), Christopher Manning, Jenny Finkel (integrating in invertible PTB tokenizer)

Nested Class Summary
static class PTBTokenizer.PTBTokenizerFactory<T>
           
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory<T> tokenFactory)
          Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory.
 
Method Summary
static TokenizerFactory<Word> factory()
           
static TokenizerFactory<Word> factory(boolean tokenizeCRs)
           
static TokenizerFactory<CoreLabel> factory(boolean tokenizeCRs, boolean invertible)
           
static TokenizerFactory<Word> factory(boolean tokenizeCRs, boolean invertible, boolean suppressEscaping)
           
static
<T> TokenizerFactory<T>
factory(boolean tokenizeCRs, LexedTokenFactory<T> factory)
           
protected  T getNext()
          Internally fetches the next token.
static void main(String[] args)
          Reads files named as arguments and print their tokens one per line.
static PTBTokenizer<Word> newPTBTokenizer(Reader r)
          Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
static PTBTokenizer<Word> newPTBTokenizer(Reader r, boolean tokenizeCRs)
          Constructs a new PTBTokenizer that optionally returns carriage returns as their own token.
static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r, boolean tokenizeCRs, boolean invertible)
          Constructs a new PTBTokenizer that optionally returns carriage returns as their own token.
static String ptb2Text(List ptbWords)
          Returns a presentable version of the given PTB-tokenized words.
static int ptb2Text(Reader ptbText, Writer w)
          Returns a presentable version of the given PTB-tokenized text.
static String ptb2Text(String ptbText)
          Returns a presentable version of the given PTB-tokenized text.
 void setSource(Reader r)
          Sets the source of this Tokenizer to be the Reader r.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PTBTokenizer

public PTBTokenizer(Reader r,
                    boolean tokenizeCRs,
                    LexedTokenFactory<T> tokenFactory)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. CRs come back as Words whose text is the value of PTBLexer.cr.

Parameters:
tokenFactory - The LexedTokenFactory to use to create tokens from the text.
Method Detail

newPTBTokenizer

public static PTBTokenizer<Word> newPTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.

Parameters:
r - The Reader whose contents will be tokenized
Returns:
a PTBTokenizer that tokenizes a stream to objects of type Word

newPTBTokenizer

public static PTBTokenizer<Word> newPTBTokenizer(Reader r,
                                                 boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. CRs come back as Words whose text is the value of PTBLexer.cr.


newPTBTokenizer

public static PTBTokenizer<CoreLabel> newPTBTokenizer(Reader r,
                                                      boolean tokenizeCRs,
                                                      boolean invertible)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. CRs come back as Words whose text is the value of PTBLexer.cr.

Parameters:
invertible - if set to true, then will produce CoreLabels which will have fields for the string before and after, and the character offsets

getNext

protected T getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<T>
Returns:
the next token in the token stream, or null if none exists.

setSource

public void setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r.

Parameters:
r - The Reader to tokenize from

ptb2Text

public static String ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.


ptb2Text

public static int ptb2Text(Reader ptbText,
                           Writer w)
                    throws IOException
Returns a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.

Throws:
IOException

ptb2Text

public static String ptb2Text(List ptbWords)
Returns a presentable version of the given PTB-tokenized words. Pass in a List of Words or Strings, or a Document and this method will join the words with spaces and call ptb2Text(String) on the output. This method will check if the elements in the list are subtypes of Word, and if so, it will take the word() values to prevent additional text from creeping in (e.g., POS tags). Otherwise the toString value will be used. Implementation note: At the moment, this can be called on either a List of String or Word. The typing should be cleaned up at some point.


factory

public static TokenizerFactory<Word> factory()

factory

public static TokenizerFactory<Word> factory(boolean tokenizeCRs)

factory

public static <T> TokenizerFactory<T> factory(boolean tokenizeCRs,
                                              LexedTokenFactory<T> factory)

factory

public static TokenizerFactory<CoreLabel> factory(boolean tokenizeCRs,
                                                  boolean invertible)

factory

public static TokenizerFactory<Word> factory(boolean tokenizeCRs,
                                             boolean invertible,
                                             boolean suppressEscaping)

main

public static void main(String[] args)
                 throws IOException
Reads files named as arguments and print their tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one-token-per-line file of tokens. This main method assumes that the input file is in utf-8 encoding, unless it is specified.

Usage: java edu.stanford.nlp.process.PTBTokenizer [-charset charset] [-nl] filename+

Options:

Parameters:
args - Command line arguments
Throws:
IOException - If any file I/O problem


Stanford NLP Group