edu.stanford.nlp.trees.international.arabic
Class ArabicTokenizer

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<Word>
      extended by edu.stanford.nlp.trees.international.arabic.ArabicTokenizer
All Implemented Interfaces:
Tokenizer<Word>, java.util.Iterator<Word>

public class ArabicTokenizer
extends AbstractTokenizer<Word>

An ArabicTokenizer is a simple tokenizer that splits off a few punctuation characters, and otherwise just splits on and discards whitespace characters. This implementation returns Word objects. It has a parameter for whether to make EOL a token or whether to treat EOL characters as whitespace. If an EOL is a token, the class returns it as a Word with String value "\n".

Author:
Christopher Manning

Nested Class Summary
static class ArabicTokenizer.ArabicTokenizerFactory
          A factory which vends ArabicTokenizers.
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
ArabicTokenizer(java.io.Reader r)
          Constructs a new ArabicTokenizer
ArabicTokenizer(java.io.Reader r, boolean eolIsSignificant)
          Constructs a new ArabicTokenizer
 
Method Summary
static TokenizerFactory<Word> factory()
           
static TokenizerFactory<Word> factory(boolean eolIsSignificant)
           
protected  Word getNext()
          Internally fetches the next token.
static void main(java.lang.String[] args)
          Reads a file from the argument and prints its tokens one per line.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicTokenizer

public ArabicTokenizer(java.io.Reader r)
Constructs a new ArabicTokenizer

Parameters:
r - The Reader r that is its source.

ArabicTokenizer

public ArabicTokenizer(java.io.Reader r,
                       boolean eolIsSignificant)
Constructs a new ArabicTokenizer

Parameters:
r - The Reader that is its source.
eolIsSignificant - Whether eol tokens should be returned.
Method Detail

getNext

protected Word getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<Word>
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory<Word> factory()

factory

public static TokenizerFactory<Word> factory(boolean eolIsSignificant)

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens.

Usage: java edu.stanford.nlp.process.ArabicTokenizer filename

Parameters:
args - Command line arguments
Throws:
java.io.IOException - If can't open files, etc.


Stanford NLP Group