edu.stanford.nlp.international.arabic.process
Class ArabicTokenizer<T extends HasWord>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<T>
      extended by edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>
All Implemented Interfaces:
Tokenizer<T>, Iterator<T>

public class ArabicTokenizer<T extends HasWord>
extends AbstractTokenizer<T>

Tokenizer for UTF-8 Arabic. Buckwalter encoding is *not* supported.

A single instance of an Arabic Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a ArabicTokenizerFactory is also not thread safe, as it keeps its options in a local variable.

TODO(spenceg): Merge in rules from ibm tokenizer (v5). TODO(spenceg): Add XML escaping TODO(spenceg): When running from the command line, the tokenizer does not produce the correct number of newline-delimited lines for the ATB data sets.

Author:
Spence Green

Nested Class Summary
static class ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord>
           
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
ArabicTokenizer(Reader r, LexedTokenFactory<T> tf, Properties lexerProperties)
           
 
Method Summary
static TokenizerFactory<CoreLabel> atbFactory()
           
static TokenizerFactory<CoreLabel> factory()
           
protected  T getNext()
          Internally fetches the next token.
static void main(String[] args)
          A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r, Properties lexerProperties)
           
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicTokenizer

public ArabicTokenizer(Reader r,
                       LexedTokenFactory<T> tf,
                       Properties lexerProperties)
Method Detail

newArabicTokenizer

public static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r,
                                                            Properties lexerProperties)

getNext

protected T getNext()
Description copied from class: AbstractTokenizer
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<T extends HasWord>
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory<CoreLabel> factory()

atbFactory

public static TokenizerFactory<CoreLabel> atbFactory()

main

public static void main(String[] args)
A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding). Performs punctuation splitting and light tokenization by default. Orthographic normalization options are available, and can be enabled with command line options. The following normalization options are provided:

Parameters:
args -


Stanford NLP Group