edu.stanford.nlp.international.french.process
Class FrenchTokenizer<T extends HasWord>
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer<T>
edu.stanford.nlp.international.french.process.FrenchTokenizer<T>
- All Implemented Interfaces:
- Tokenizer<T>, java.util.Iterator<T>
public class FrenchTokenizer<T extends HasWord>
- extends AbstractTokenizer<T>
Tokenizer for raw French text. This tokenization scheme is a derivative
of PTB tokenization, but with extra rules for French elision and compounding.
The tokenizer implicitly inserts segmentation markers by not normalizing
the apostrophe and hyphen. Detokenization can thus be performed by right-concatenating
apostrophes and left-concatenating hyphens.
A single instance of an French Tokenizer is not thread safe, as it
uses a non-threadsafe jflex object to do the processing. Multiple
instances can be created safely, though. A single instance of a
FrenchTokenizerFactory is also not thread safe, as it keeps its
options in a local variable.
- Author:
- Spence Green
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
FrenchTokenizer
public FrenchTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties)
newFrenchTokenizer
public static FrenchTokenizer<CoreLabel> newFrenchTokenizer(java.io.Reader r,
java.util.Properties lexerProperties)
getNext
protected T getNext()
- Description copied from class:
AbstractTokenizer
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer<T extends HasWord>
- Returns:
- the next token in the token stream, or null if none exists.
factory
public static TokenizerFactory<CoreLabel> factory()
ftbFactory
public static TokenizerFactory<CoreLabel> ftbFactory()
main
public static void main(java.lang.String[] args)
- A fast, rule-based tokenizer for Modern Standard French.
Performs punctuation splitting and light tokenization by default.
Currently, this tokenizer does not do line splitting. It assumes that the input
file is delimited by the system line separator. The output will be equivalently
delimited.
- Parameters:
args
-
Stanford NLP Group