edu.stanford.nlp.international.french.process
Class FrenchTokenizer<T extends HasWord>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer<T>
      extended by edu.stanford.nlp.international.french.process.FrenchTokenizer<T>
All Implemented Interfaces:
Tokenizer<T>, java.util.Iterator<T>

public class FrenchTokenizer<T extends HasWord>
extends AbstractTokenizer<T>

Tokenizer for raw French text. This tokenization scheme is a derivative of PTB tokenization, but with extra rules for French elision and compounding.

The tokenizer implicitly inserts segmentation markers by not normalizing the apostrophe and hyphen. Detokenization can thus be performed by right-concatenating apostrophes and left-concatenating hyphens.

A single instance of an French Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a FrenchTokenizerFactory is also not thread safe, as it keeps its options in a local variable.

Author:
Spence Green

Nested Class Summary
static class FrenchTokenizer.FrenchTokenizerFactory<T extends HasWord>
           
 
Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
FrenchTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties)
           
 
Method Summary
static TokenizerFactory<CoreLabel> factory()
           
static TokenizerFactory<CoreLabel> ftbFactory()
           
protected  T getNext()
          Internally fetches the next token.
static void main(java.lang.String[] args)
          A fast, rule-based tokenizer for Modern Standard French.
static FrenchTokenizer<CoreLabel> newFrenchTokenizer(java.io.Reader r, java.util.Properties lexerProperties)
           
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FrenchTokenizer

public FrenchTokenizer(java.io.Reader r,
                       LexedTokenFactory<T> tf,
                       java.util.Properties lexerProperties)
Method Detail

newFrenchTokenizer

public static FrenchTokenizer<CoreLabel> newFrenchTokenizer(java.io.Reader r,
                                                            java.util.Properties lexerProperties)

getNext

protected T getNext()
Description copied from class: AbstractTokenizer
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer<T extends HasWord>
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory<CoreLabel> factory()

ftbFactory

public static TokenizerFactory<CoreLabel> ftbFactory()

main

public static void main(java.lang.String[] args)
A fast, rule-based tokenizer for Modern Standard French. Performs punctuation splitting and light tokenization by default.

Currently, this tokenizer does not do line splitting. It assumes that the input file is delimited by the system line separator. The output will be equivalently delimited.

Parameters:
args -


Stanford NLP Group