edu.stanford.nlp.international.arabic.process
Class ArabicTokenizer<T extends HasWord>
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>
- All Implemented Interfaces:
- Tokenizer<T>, Iterator<T>
public class ArabicTokenizer<T extends HasWord>
- extends AbstractTokenizer<T>
Tokenizer for UTF-8 Arabic. Buckwalter encoding is *not* supported.
TODO(spenceg): Merge in rules from ibm tokenizer (v5).
TODO(spenceg): Add XML escaping
TODO(spenceg): When running from the command line, the tokenizer does not
produce the correct number of newline-delimited lines for the ATB data
sets.
- Author:
- Spence Green
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ArabicTokenizer
public ArabicTokenizer(Reader r,
LexedTokenFactory<T> tf,
Properties lexerProperties)
newArabicTokenizer
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r,
Properties lexerProperties)
getNext
protected T getNext()
- Description copied from class:
AbstractTokenizer
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<T extends HasWord>
- Returns:
- the next token in the token stream, or null if none exists.
factory
public static TokenizerFactory<CoreLabel> factory()
atbFactory
public static TokenizerFactory<CoreLabel> atbFactory()
main
public static void main(String[] args)
- Parameters:
args
-
Stanford NLP Group