edu.stanford.nlp.process
Class PTBTokenizer

java.lang.Object
  |
  +--edu.stanford.nlp.process.AbstractTokenizer
        |
        +--edu.stanford.nlp.process.PTBTokenizer
All Implemented Interfaces:
Iterator, Tokenizer

public class PTBTokenizer
extends AbstractTokenizer

Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.

Author:
Teg Grenager (grenager@stanford.edu)

Constructor Summary
PTBTokenizer()
          Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
PTBTokenizer(boolean tokenizeCRs)
          Constructs a new PTBTokenizer that optionally returns carriage returns as their own token.
PTBTokenizer(Reader r)
          Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
PTBTokenizer(Reader r, boolean tokenizeCRs)
          Constructs a new PTBTokenizer that optionally returns carriage returns as their own token.
 
Method Summary
 boolean hasNext()
          Returns true if this Tokenizer has more elements.
static void main(String[] args)
          Reads a file from the argument and prints its tokens one per line.
 Object next()
          Returns the next Word token, or null if there is none.
 void setSource(Reader r)
          Sets the source of this Tokenizer to be the Reader r.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
pushBack, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PTBTokenizer

public PTBTokenizer()
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace. No source is specified, so hasNext() will return false.


PTBTokenizer

public PTBTokenizer(boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. CRs come back as Words whose text is PTBLexer#cr.


PTBTokenizer

public PTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.


PTBTokenizer

public PTBTokenizer(Reader r,
                    boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. CRs come back as Words whose text is PTBLexer#cr.

Method Detail

hasNext

public boolean hasNext()
Returns true if this Tokenizer has more elements.

Specified by:
hasNext in interface Tokenizer
Specified by:
hasNext in class AbstractTokenizer

next

public Object next()
Returns the next Word token, or null if there is none.

Specified by:
next in interface Tokenizer
Specified by:
next in class AbstractTokenizer

main

public static void main(String[] args)
                 throws IOException
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens.

Usage: java edu.stanford.nlp.process.PTBTokenizer filename

Parameters:
args - Command line arguments
IOException

setSource

public void setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r.

Specified by:
setSource in interface Tokenizer
Specified by:
setSource in class AbstractTokenizer


Stanford NLP Group