The statistical tokenizer implementation.
It produces an UD compliant tokenization given a pre-trained UD model file,
and a file containing multi-word tokens that have to be treated with rules after tokenization.
It reads raw text from CoreAnnotations.textAnnotation,
and returns sentences and tokens in the form:
Each List object contains Tokens Annotation for a single sentence.
public StatTokSent(java.lang.String modelFile,
This is the constructor for the StatTokSent object.
modelFile: a string containing the path to the model;
multiWordRulesFile: a string containing the path to the file with multi-word tokens.
public StatTokSent(java.lang.String modelFile)
public java.util.List<java.util.List<CoreLabel>> tokenize(java.lang.String text)
The core tokenization function called via the statistical tokenizer annotator. Given a text and window size as input,
returns a List> corresponding to sentences and tokens within each sentence.
First, features are extracted for each character of the text, and the model is used to classify each character as:
S: begin of sentence;
T: begin of token;
C: begin of part in a multi-word token;
I: inside of a token;
O: outside any token.
Given the classification, characters are joined in tokens.
For each sentence, tokens further analyzed to identify possibily misclassified multi-word tokens via rules, and indexed.
public static void main(java.lang.String args)