public class WhitespaceTokenizer<T extends HasWord> extends AbstractTokenizer<T>
isWhitespace
. That is, a whitespace
is a Unicode SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR, or one of the control characters
U+0009-U+000D or U+001C-U+001F except the non-breaking space characters. The one addition is
to also allow U+0085 as a line ending character, for compatibility with certain IBM systems.
For including "spaces" in tokens, it is recommended that you represent them as the non-break space
character U+00A0.Modifier and Type | Class and Description |
---|---|
static class |
WhitespaceTokenizer.WhitespaceTokenizerFactory<T extends HasWord>
A factory which vends WhitespaceTokenizers.
|
NEWLINE_TOKEN, nextToken
Constructor and Description |
---|
WhitespaceTokenizer(LexedTokenFactory factory,
java.io.Reader r,
boolean eolIsSignificant)
Constructs a new WhitespaceTokenizer.
|
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<Word> |
factory() |
static TokenizerFactory<Word> |
factory(boolean eolIsSignificant) |
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(java.lang.String[] args)
Reads a file from the argument and prints its tokens one per line.
|
static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> |
newCoreLabelTokenizerFactory() |
static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> |
newCoreLabelTokenizerFactory(java.lang.String options) |
static WhitespaceTokenizer<CoreLabel> |
newCoreLabelWhitespaceTokenizer(java.io.Reader r) |
static WhitespaceTokenizer<CoreLabel> |
newCoreLabelWhitespaceTokenizer(java.io.Reader r,
boolean tokenizeNLs) |
static WhitespaceTokenizer<Word> |
newWordWhitespaceTokenizer(java.io.Reader r) |
static WhitespaceTokenizer<Word> |
newWordWhitespaceTokenizer(java.io.Reader r,
boolean eolIsSignificant) |
hasNext, next, peek, remove, tokenize
public WhitespaceTokenizer(LexedTokenFactory factory, java.io.Reader r, boolean eolIsSignificant)
r
- The Reader that is its source.eolIsSignificant
- Whether eol tokens should be returned.public static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> newCoreLabelTokenizerFactory(java.lang.String options)
public static WhitespaceTokenizer.WhitespaceTokenizerFactory<CoreLabel> newCoreLabelTokenizerFactory()
protected T getNext()
getNext
in class AbstractTokenizer<T extends HasWord>
public static WhitespaceTokenizer<CoreLabel> newCoreLabelWhitespaceTokenizer(java.io.Reader r)
public static WhitespaceTokenizer<CoreLabel> newCoreLabelWhitespaceTokenizer(java.io.Reader r, boolean tokenizeNLs)
public static WhitespaceTokenizer<Word> newWordWhitespaceTokenizer(java.io.Reader r)
public static WhitespaceTokenizer<Word> newWordWhitespaceTokenizer(java.io.Reader r, boolean eolIsSignificant)
public static TokenizerFactory<Word> factory()
public static TokenizerFactory<Word> factory(boolean eolIsSignificant)
public static void main(java.lang.String[] args) throws java.io.IOException
java edu.stanford.nlp.process.WhitespaceTokenizer filename
args
- Command line argumentsjava.io.IOException
- If can't open files, etc.