edu.stanford.nlp.international.arabic.process
Class ArabicTokenizer<T extends HasWord>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<T>
      extended by edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>
All Implemented Interfaces:
Tokenizer<T>, Iterator<T>

public class ArabicTokenizer<T extends HasWord>
extends AbstractTokenizer<T>

Tokenizer for UTF-8 Arabic. Buckwalter encoding is *not* supported.

TODO(spenceg): Merge in rules from ibm tokenizer (v5). TODO(spenceg): Add XML escaping TODO(spenceg): When running from the command line, the tokenizer does not produce the correct number of newline-delimited lines for the ATB data sets.

Author:
Spence Green

Nested Class Summary
static class ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord>
           
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
ArabicTokenizer(Reader r, LexedTokenFactory<T> tf, Properties lexerProperties)
           
 
Method Summary
static TokenizerFactory<CoreLabel> atbFactory()
           
static TokenizerFactory<CoreLabel> factory()
           
protected  T getNext()
          Internally fetches the next token.
static void main(String[] args)
           
static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r, Properties lexerProperties)
           
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicTokenizer

public ArabicTokenizer(Reader r,
                       LexedTokenFactory<T> tf,
                       Properties lexerProperties)
Method Detail

newArabicTokenizer

public static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r,
                                                            Properties lexerProperties)

getNext

protected T getNext()
Description copied from class: AbstractTokenizer
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<T extends HasWord>
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory<CoreLabel> factory()

atbFactory

public static TokenizerFactory<CoreLabel> atbFactory()

main

public static void main(String[] args)
Parameters:
args -


Stanford NLP Group