edu.stanford.nlp.process
Class WhitespaceTokenizer

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<Word>
      extended by edu.stanford.nlp.process.WhitespaceTokenizer
All Implemented Interfaces:
Tokenizer<Word>, java.util.Iterator<Word>

public class WhitespaceTokenizer
extends AbstractTokenizer<Word>

A WhitespaceTokenizer is a tokenizer that splits on and discards only whitespace characters. This implementation returns Word objects. It has a parameter for whether to make EOL a token or whether to treat EOL characters as whitespace. If an EOL is a token, the class returns it as a Word with String value "\n".

Implementation note: This was rewritten in Apr 2006 to discard the old StreamTokenizer based implementation and to replace it with a Unicode compliant JFlex-based version.

Author:
Joseph Smarr (jsmarr@stanford.edu), Teg Grenager (grenager@stanford.edu), Roger Levy, Christopher Manning

Nested Class Summary
static class WhitespaceTokenizer.WhitespaceTokenizerFactory
          A factory which vends WhitespaceTokenizers.
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
WhitespaceTokenizer(java.io.Reader r)
          Constructs a new WhitespaceTokenizer
WhitespaceTokenizer(java.io.Reader r, boolean eolIsSignificant)
          Constructs a new WhitespaceTokenizer
 
Method Summary
static TokenizerFactory<Word> factory()
           
static TokenizerFactory<Word> factory(boolean eolIsSignificant)
           
protected  Word getNext()
          Internally fetches the next token.
static void main(java.lang.String[] args)
          Reads a file from the argument and prints its tokens one per line.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WhitespaceTokenizer

public WhitespaceTokenizer(java.io.Reader r)
Constructs a new WhitespaceTokenizer

Parameters:
r - The Reader r that is its source.

WhitespaceTokenizer

public WhitespaceTokenizer(java.io.Reader r,
                           boolean eolIsSignificant)
Constructs a new WhitespaceTokenizer

Parameters:
r - The Reader that is its source.
eolIsSignificant - Whether eol tokens should be returned.
Method Detail

getNext

protected Word getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<Word>
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory<Word> factory()

factory

public static TokenizerFactory<Word> factory(boolean eolIsSignificant)

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens.

Usage: java edu.stanford.nlp.process.WhitespaceTokenizer filename

Parameters:
args - Command line arguments
Throws:
java.io.IOException - If can't open files, etc.


Stanford NLP Group