Class Summary |
AbstractListProcessor<IN,OUT,L,F> |
Class AbstractListProcessor |
AbstractTokenizer<T> |
An abstract tokenizer. |
Americanize |
Takes a HasWord or String and returns an Americanized version of it. |
CoreLabelTokenFactory |
Constructs CoreLabel s from Strings optionally with
beginning and ending (character after the end) offset positions in
an original text. |
DocumentPreprocessor |
Fully customizable preprocessor for XML, HTML, and PLAIN text documents. |
Morphology |
Morphology computes the base form of English words, by removing just
inflections (not derivational morphology). |
PTBEscapingProcessor<IN extends HasWord,L,F> |
Produces a new Document of Words in which special characters of the PTB
have been properly escaped. |
PTBTokenizer<T extends HasWord> |
Tokenizer implementation that conforms to the Penn Treebank tokenization
conventions. |
PTBTokenizer.PTBTokenizerFactory<T extends HasWord> |
|
StripTagsProcessor<L,F> |
A Processor whose process method deletes all
SGML/XML/HTML tags (tokens starting with < and ending
with >. |
TokenizerAdapter |
This class adapts between a java.io.StreamTokenizer
and a edu.stanford.nlp.process.Tokenizer . |
TransformXML<T> |
Reads XML from an input file or stream and writes XML to an output
file or stream, while transforming text appearing inside specified
XML tags by applying a specified Function . |
TransformXML.SAXInterface<T> |
|
WhitespaceTokenizer |
A WhitespaceTokenizer is a tokenizer that splits on and discards only
whitespace characters. |
WhitespaceTokenizer.WhitespaceTokenizerFactory |
A factory which vends WhitespaceTokenizers. |
WordShapeClassifier |
Provides static methods which
map any String to another String indicative of its "word shape" -- e.g.,
whether capitalized, numeric, etc. |
WordTokenFactory |
Constructs a Word from a String. |
WordToSentenceProcessor<IN> |
Transforms a Document of Words into a Document of Sentences by grouping the
Words. |
WordToTaggedWordProcessor<IN extends HasWord,L,F> |
Transforms a Document of Words into a document all or partly of
TaggedWords by breaking words on a tag divider character. |