StatTokSent (Stanford JavaNLP API)

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- edu.stanford.nlp.process.stattok.StatTokSent

```
public class StatTokSent
extends java.lang.Object
```
The statistical tokenizer implementation.
It produces an UD compliant tokenization given a pre-trained UD model file, and a file containing multi-word tokens that have to be treated with rules after tokenization.
It reads raw text from CoreAnnotations.textAnnotation, and returns sentences and tokens in the form: List<List<CoreLabel>>
Each List<CoreLabel> object contains Tokens Annotation for a single sentence.

Author:

Alessandro Bondielli

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String SENTINEL

Constructor Summary

Constructors
Constructor and Description
`StatTokSent(java.lang.String modelFile)`
`StatTokSent(java.lang.String modelFile, java.lang.String multiWordRulesFile)` This is the constructor for the StatTokSent object.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`main(java.lang.String[] args)`
`java.util.List<java.util.List<CoreLabel>>`	`tokenize(java.lang.String text)` The core tokenization function called via the statistical tokenizer annotator.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - SENTINEL
```
public static final java.lang.String SENTINEL
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - StatTokSent
```
public StatTokSent(java.lang.String modelFile,
                   java.lang.String multiWordRulesFile)
```
    This is the constructor for the StatTokSent object. Parameters: modelFile: a string containing the path to the model; multiWordRulesFile: a string containing the path to the file with multi-word tokens.
  - StatTokSent
```
public StatTokSent(java.lang.String modelFile)
```
- Method Detail
  - tokenize
```
public java.util.List<java.util.List<CoreLabel>> tokenize(java.lang.String text)
```
    The core tokenization function called via the statistical tokenizer annotator. Given a text and window size as input, returns a List<List<CoreLabel>> corresponding to sentences and tokens within each sentence. First, features are extracted for each character of the text, and the model is used to classify each character as:
    S: begin of sentence;
    T: begin of token;
    C: begin of part in a multi-word token;
    I: inside of a token;
    O: outside any token.
    Given the classification, characters are joined in tokens. For each sentence, tokens further analyzed to identify possibily misclassified multi-word tokens via rules, and indexed.
  - main
```
public static void main(java.lang.String[] args)
                 throws java.lang.Exception
```
    Throws:
    
    java.lang.Exception

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Stanford NLP Group