WhitespaceTokenizer (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.process.AbstractTokenizer<T>
- - edu.stanford.nlp.process.WhitespaceTokenizer<T>

All Implemented Interfaces:

Tokenizer<T>, java.util.Iterator<T>
```
public class WhitespaceTokenizer<T extends HasWord>
extends AbstractTokenizer<T>
```
A WhitespaceTokenizer is a tokenizer that splits on and discards only whitespace characters. This implementation can return Word, CoreLabel or other LexedToken objects. It has a parameter for whether to make EOL a token or whether to treat EOL characters as whitespace. If an EOL is a token, the class returns it as a Word with String value "\n". Implementation note: This was rewritten in Apr 2006 to discard the old StreamTokenizer-based implementation and to replace it with a Unicode compliant JFlex-based version. This tokenizer treats as Whitespace almost exactly the same characters deemed Whitespace by the Java function isWhitespace. That is, a whitespace is a Unicode SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR, or one of the control characters U+0009-U+000D or U+001C-U+001F except the non-breaking space characters. The one addition is to also allow U+0085 as a line ending character, for compatibility with certain IBM systems. For including "spaces" in tokens, it is recommended that you represent them as the non-break space character U+00A0.

Author:

Joseph Smarr (jsmarr@stanford.edu), Teg Grenager (grenager@stanford.edu), Roger Levy, Christopher Manning

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class WhitespaceTokenizer.WhitespaceTokenizerFactory<T extends HasWord>
A factory which vends WhitespaceTokenizers.

Nested Classes
Modifier and Type	Class and Description
`static class`	`WhitespaceTokenizer.WhitespaceTokenizerFactory<T extends HasWord>` A factory which vends WhitespaceTokenizers.

Field Summary
- Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
  NEWLINE_TOKEN, nextToken

Constructor Summary

Constructors
Constructor and Description
`WhitespaceTokenizer(LexedTokenFactory factory, java.io.Reader r, boolean eolIsSignificant)` Constructs a new WhitespaceTokenizer.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static TokenizerFactory<Word>`	`factory()`
`static TokenizerFactory<Word>`	`factory(boolean eolIsSignificant)`
`protected T`	`getNext()` Internally fetches the next token.
`static void`	`main(java.lang.String[] args)` Reads a file from the argument and prints its tokens one per line.
`static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel>`	`newCoreLabelTokenizerFactory()`
`static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel>`	`newCoreLabelTokenizerFactory(java.lang.String options)`
`static WhitespaceTokenizer<CoreLabel>`	`newCoreLabelWhitespaceTokenizer(java.io.Reader r)`
`static WhitespaceTokenizer<CoreLabel>`	`newCoreLabelWhitespaceTokenizer(java.io.Reader r, boolean tokenizeNLs)`
`static WhitespaceTokenizer<Word>`	`newWordWhitespaceTokenizer(java.io.Reader r)`
`static WhitespaceTokenizer<Word>`	`newWordWhitespaceTokenizer(java.io.Reader r, boolean eolIsSignificant)`

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.Iterator
forEachRemaining

Constructor Detail
- WhitespaceTokenizer
```
public WhitespaceTokenizer(LexedTokenFactory factory,
                           java.io.Reader r,
                           boolean eolIsSignificant)
```
  Constructs a new WhitespaceTokenizer.
  
  Parameters:
  
  r - The Reader that is its source.
  
  eolIsSignificant - Whether eol tokens should be returned.

Method Detail

newCoreLabelTokenizerFactory

public static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> newCoreLabelTokenizerFactory(java.lang.String options)

newCoreLabelTokenizerFactory

public static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> newCoreLabelTokenizerFactory()

getNext
```
protected T getNext()
```
Internally fetches the next token.

Specified by:

getNext in class AbstractTokenizer<T extends HasWord>

Returns:

the next token in the token stream, or null if none exists.

newCoreLabelWhitespaceTokenizer

public static WhitespaceTokenizer<CoreLabel> newCoreLabelWhitespaceTokenizer(java.io.Reader r)

newCoreLabelWhitespaceTokenizer

public static WhitespaceTokenizer<CoreLabel> newCoreLabelWhitespaceTokenizer(java.io.Reader r,
                                                                             boolean tokenizeNLs)

newWordWhitespaceTokenizer

public static WhitespaceTokenizer<Word> newWordWhitespaceTokenizer(java.io.Reader r)

newWordWhitespaceTokenizer

public static WhitespaceTokenizer<Word> newWordWhitespaceTokenizer(java.io.Reader r,
                                                                   boolean eolIsSignificant)

factory

public static TokenizerFactory<Word> factory()

factory

public static TokenizerFactory<Word> factory(boolean eolIsSignificant)

main
```
public static void main(java.lang.String[] args)
                 throws java.io.IOException
```
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens. Usage: java edu.stanford.nlp.process.WhitespaceTokenizer filename

Parameters:

args - Command line arguments

Throws:

java.io.IOException - If can't open files, etc.

Class WhitespaceTokenizer<T extends HasWord>

Nested Class Summary

Field Summary

Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.Iterator

Constructor Detail

WhitespaceTokenizer

Method Detail

newCoreLabelTokenizerFactory

newCoreLabelTokenizerFactory

getNext

newCoreLabelWhitespaceTokenizer

newCoreLabelWhitespaceTokenizer

newWordWhitespaceTokenizer

newWordWhitespaceTokenizer

factory

factory

main