edu.stanford.nlp.process
Class InvertiblePTBTokenizer
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<FeatureLabel>
edu.stanford.nlp.process.InvertiblePTBTokenizer
- All Implemented Interfaces:
- Tokenizer<FeatureLabel>, Iterator<FeatureLabel>
public class InvertiblePTBTokenizer
- extends AbstractTokenizer<FeatureLabel>
Tokenizer implementation that conforms to the Penn Treebank tokenization
conventions.
This tokenizer is a Java implementation of Professor Chris Manning's Flex
tokenizer, pgtt-treebank.l. It reads raw text and outputs
tokens as edu.stanford.nlp.trees.MapLabels in the Penn treebank format.
- Author:
- Jenny Finkel (jrfinkel@stanford.edu)
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BEFORE_KEY
public static final String BEFORE_KEY
- See Also:
- Constant Field Values
CURRENT_KEY
public static final String CURRENT_KEY
- See Also:
- Constant Field Values
AFTER_KEY
public static final String AFTER_KEY
- See Also:
- Constant Field Values
START_POSITION_KEY
public static final String START_POSITION_KEY
- See Also:
- Constant Field Values
END_POSITION_KEY
public static final String END_POSITION_KEY
- See Also:
- Constant Field Values
InvertiblePTBTokenizer
public InvertiblePTBTokenizer(Reader r)
- Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
InvertiblePTBTokenizer
public InvertiblePTBTokenizer(Reader r,
boolean tokenizeCRs)
- Constructs a new PTBTokenizer.
getNext
protected FeatureLabel getNext()
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<FeatureLabel>
- Returns:
- the next token in the token stream, or null if none exists.
main
public static void main(String[] args)
throws IOException
- Reads a file from the argument and prints its tokens one per line.
This is mainly as a testing aid, but it can also be quite useful
standalone to turn a corpus into a one token per line file of tokens.
Usage:
java edu.stanford.nlp.process.PTBTokenizer filename
- Parameters:
args
- Command line arguments
- Throws:
IOException
setSource
public void setSource(Reader r)
- Sets the source of this Tokenizer to be the Reader r.
factory
public static TokenizerFactory factory()
factory
public static TokenizerFactory factory(boolean tokenizeCRs)
Stanford NLP Group