tokenize
public java.util.List<java.util.List<CoreLabel>> tokenize(java.lang.String text)
The core tokenization function called via the statistical tokenizer annotator. Given a text and window size as input,
returns a List> corresponding to sentences and tokens within each sentence.
First, features are extracted for each character of the text, and the model is used to classify each character as:
S: begin of sentence;
T: begin of token;
C: begin of part in a multi-word token;
I: inside of a token;
O: outside any token.
Given the classification, characters are joined in tokens.
For each sentence, tokens further analyzed to identify possibily misclassified multi-word tokens via rules, and indexed.