edu.stanford.nlp.process
Class InvertiblePTBTokenizer

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<FeatureLabel>
      extended by edu.stanford.nlp.process.InvertiblePTBTokenizer
All Implemented Interfaces:
Tokenizer<FeatureLabel>, Iterator<FeatureLabel>

public class InvertiblePTBTokenizer
extends AbstractTokenizer<FeatureLabel>

Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.MapLabels in the Penn treebank format.

Author:
Jenny Finkel (jrfinkel@stanford.edu)

Nested Class Summary
static class InvertiblePTBTokenizer.InvertiblePTBTokenizerFactory
           
 
Field Summary
static String AFTER_KEY
           
static String BEFORE_KEY
           
static String CURRENT_KEY
           
static String END_POSITION_KEY
           
static String START_POSITION_KEY
           
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
InvertiblePTBTokenizer(Reader r)
          Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
InvertiblePTBTokenizer(Reader r, boolean tokenizeCRs)
          Constructs a new PTBTokenizer.
 
Method Summary
static TokenizerFactory factory()
           
static TokenizerFactory factory(boolean tokenizeCRs)
           
protected  FeatureLabel getNext()
          Internally fetches the next token.
static void main(String[] args)
          Reads a file from the argument and prints its tokens one per line.
 void setSource(Reader r)
          Sets the source of this Tokenizer to be the Reader r.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BEFORE_KEY

public static final String BEFORE_KEY
See Also:
Constant Field Values

CURRENT_KEY

public static final String CURRENT_KEY
See Also:
Constant Field Values

AFTER_KEY

public static final String AFTER_KEY
See Also:
Constant Field Values

START_POSITION_KEY

public static final String START_POSITION_KEY
See Also:
Constant Field Values

END_POSITION_KEY

public static final String END_POSITION_KEY
See Also:
Constant Field Values
Constructor Detail

InvertiblePTBTokenizer

public InvertiblePTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.


InvertiblePTBTokenizer

public InvertiblePTBTokenizer(Reader r,
                              boolean tokenizeCRs)
Constructs a new PTBTokenizer.

Method Detail

getNext

protected FeatureLabel getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<FeatureLabel>
Returns:
the next token in the token stream, or null if none exists.

main

public static void main(String[] args)
                 throws IOException
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens.

Usage: java edu.stanford.nlp.process.PTBTokenizer filename

Parameters:
args - Command line arguments
Throws:
IOException

setSource

public void setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r.


factory

public static TokenizerFactory factory()

factory

public static TokenizerFactory factory(boolean tokenizeCRs)


Stanford NLP Group