edu.stanford.nlp.international.arabic.process
Class ArabicTokenizer<T extends HasWord>
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>
- All Implemented Interfaces:
- Tokenizer<T>, Iterator<T>
public class ArabicTokenizer<T extends HasWord>
- extends AbstractTokenizer<T>
Tokenizer for UTF-8 Arabic. Buckwalter encoding is *not* supported.
A single instance of an Arabic Tokenizer is not thread safe, as it
uses a non-threadsafe jflex object to do the processing. Multiple
instances can be created safely, though. A single instance of a
ArabicTokenizerFactory is also not thread safe, as it keeps its
options in a local variable.
TODO(spenceg): Merge in rules from ibm tokenizer (v5).
TODO(spenceg): Add XML escaping
TODO(spenceg): When running from the command line, the tokenizer does not
produce the correct number of newline-delimited lines for the ATB data
sets.
- Author:
- Spence Green
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ArabicTokenizer
public ArabicTokenizer(Reader r,
LexedTokenFactory<T> tf,
Properties lexerProperties)
newArabicTokenizer
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(Reader r,
Properties lexerProperties)
getNext
protected T getNext()
- Description copied from class:
AbstractTokenizer
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<T extends HasWord>
- Returns:
- the next token in the token stream, or null if none exists.
factory
public static TokenizerFactory<CoreLabel> factory()
atbFactory
public static TokenizerFactory<CoreLabel> atbFactory()
main
public static void main(String[] args)
- A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
Performs punctuation splitting and light tokenization by default.
Orthographic normalization options are available, and can be enabled with
command line options.
The following normalization options are provided:
useUTF8Ellipsis
: Replaces sequences of three or more full stops with …
normArDigits
: Convert Arabic digits to ASCII equivalents
normArPunc
: Convert Arabic punctuation to ASCII equivalents
normAlif
: Change all alif forms to bare alif
normYa
: Map ya to alif maqsura
removeDiacritics
: Strip all diacritics
removeTatweel
: Strip tatweel elongation character
removeQuranChars
: Remove diacritics that appear in the Quran
removeProMarker
: Remove the ATB null pronoun marker
removeSegMarker
: Remove the ATB clitic segmentation marker
removeMorphMarker
: Remove the ATB morpheme boundary markers
atbEscaping
: Replace left/right parentheses with ATB escape characters
- Parameters:
args
-
Stanford NLP Group