|CoreTokenFactory<IN extends CoreMap>||
To make tokens like CoreMap or CoreLabel.
Top-level interface for transforming Documents.
Constructs a token (of arbitrary type) from a String and its position in the underlying text.
An interface for things that operate on a List.
This interface is a conjunction of Function and Serializable, which is a bad idea from the perspective of the type system, but one that seems more palatable than other bad ideas until java's type system is flexible enough to support type conjunctions.
Tokenizers break up text into individual Objects.
A TokenizerFactory is used to convert a java.io.Reader into a Tokenizer (an extension of Iterator) over objects of type T represented by the text in the java.io.Reader.
An interface for segmenting strings into words (in unwordsegmented languages).
An abstract tokenizer.
Takes a HasWord or String and returns an Americanized version of it.
Convert a Chinese Document into a List of sentence Strings.
Maps a String to its distributional similarity class.
Produces a list of sentences from either a plain text or XML document.
Morphology computes the base form of English words, by removing just inflections (not derivational morphology).
|PTBEscapingProcessor<IN extends HasWord,L,F>||
Produces a new Document of Words in which special characters of the PTB have been properly escaped.
|PTBTokenizer<T extends HasWord>||
A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text.
|PTBTokenizer.PTBTokenizerFactory<T extends HasWord>||
This class provides a factory which will vend instances of PTBTokenizer which wrap a provided Reader.
This class adapts between a
Reads XML from an input file or stream and writes XML to an output file or stream, while transforming text appearing inside specified XML tags by applying a specified
This version of the SAXInterface doesn't escape the text produced by the function.
|WhitespaceTokenizer<T extends HasWord>||
A WhitespaceTokenizer is a tokenizer that splits on and discards only whitespace characters.
|WhitespaceTokenizer.WhitespaceTokenizerFactory<T extends HasWord>||
A factory which vends WhitespaceTokenizers.
A tokenizer that works by calling a WordSegmenter.
Provides static methods which map any String to another String indicative of its "word shape" -- e.g., whether capitalized, numeric, etc.
Constructs a Word from a String.
Transforms a List of words into a List of Lists of words (that is, a List of sentences), by grouping the words.
|WordToTaggedWordProcessor<IN extends HasWord,L,F>||
Transforms a Document of Words into a document all or partly of TaggedWords by breaking words on a tag divider character.