edu.stanford.nlp.international.arabic.process
Class ArabicTokenizer<T extends HasWord>
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>
- All Implemented Interfaces:
- Tokenizer<T>, java.util.Iterator<T>
public class ArabicTokenizer<T extends HasWord>
- extends AbstractTokenizer<T>
Tokenizer for UTF-8 Arabic. Buckwalter encoding is *not* supported.
A single instance of an Arabic Tokenizer is not thread safe, as it
uses a non-threadsafe jflex object to do the processing. Multiple
instances can be created safely, though. A single instance of a
ArabicTokenizerFactory is also not thread safe, as it keeps its
options in a local variable.
TODO(spenceg): Merge in rules from ibm tokenizer (v5).
TODO(spenceg): Add XML escaping
TODO(spenceg): When running from the command line, the tokenizer does not
produce the correct number of newline-delimited lines for the ATB data
sets.
- Author:
- Spence Green
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ArabicTokenizer
public ArabicTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties)
newArabicTokenizer
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(java.io.Reader r,
java.util.Properties lexerProperties)
getNext
protected T getNext()
- Description copied from class:
AbstractTokenizer
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<T extends HasWord>
- Returns:
- the next token in the token stream, or null if none exists.
factory
public static TokenizerFactory<CoreLabel> factory()
atbFactory
public static TokenizerFactory<CoreLabel> atbFactory()
main
public static void main(java.lang.String[] args)
- A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
Performs punctuation splitting and light tokenization by default.
Orthographic normalization options are available, and can be enabled with
command line options.
Currently, this tokenizer does not do line splitting. It normalizes non-printing
line separators across platforms and prints the system default line splitter
to the output.
The following normalization options are provided:
useUTF8Ellipsis
: Replaces sequences of three or more full stops with …
normArDigits
: Convert Arabic digits to ASCII equivalents
normArPunc
: Convert Arabic punctuation to ASCII equivalents
normAlif
: Change all alif forms to bare alif
normYa
: Map ya to alif maqsura
removeDiacritics
: Strip all diacritics
removeTatweel
: Strip tatweel elongation character
removeQuranChars
: Remove diacritics that appear in the Quran
removeProMarker
: Remove the ATB null pronoun marker
removeSegMarker
: Remove the ATB clitic segmentation marker
removeMorphMarker
: Remove the ATB morpheme boundary markers
atbEscaping
: Replace left/right parentheses with ATB escape characters
- Parameters:
args
-
Stanford NLP Group