edu.stanford.nlp.process (Stanford JavaNLP API)

Interface Summary
Interface	Description
CoreTokenFactory<IN extends CoreMap>	To make tokens like CoreMap or CoreLabel.
DocumentProcessor<IN,OUT,L,F>	Top-level interface for transforming Documents.
LexedTokenFactory<T>	Constructs a token (of arbitrary type) from a String and its position in the underlying text.
ListProcessor<IN,OUT>	An interface for things that operate on a List.
SerializableFunction<T1,T2>	This interface is a conjunction of Function and Serializable, which is a bad idea from the perspective of the type system, but one that seems more palatable than other bad ideas until java's type system is flexible enough to support type conjunctions.
Tokenizer<T>	Tokenizers break up text into individual Objects.
TokenizerFactory<T>	A TokenizerFactory is a factory that can build a Tokenizer (an extension of Iterator) from a java.io.Reader.
TSVSentenceProcessor	An interface for running an action (a callback function) on each line of a TSV file representing a collection of sentences in a corpus.
WordSegmenter	An interface for segmenting strings into words (in unwordsegmented languages).

Class Summary
Class	Description
AbstractListProcessor<IN,OUT,L,F>	Class AbstractListProcessor
AbstractTokenizer<T>	An abstract tokenizer.
Americanize	Takes a HasWord or String and returns an Americanized version of it.
AmericanizeFunction
ChineseDocumentToSentenceProcessor	Convert a Chinese Document into a List of sentence Strings.
CodepointCoreLabelProcessor	Processor to add codepoint counts to tokens In general this will be the same as the character offsets, but certain fancy characters such as 𝒚̂𝒊 will change that.
CoreLabelProcessor	Abstract class for processing a `List<CoreLabel>`.
CoreLabelTokenFactory	Constructs `CoreLabel`s from Strings optionally with beginning and ending (character after the end) offset positions in an original text.
DistSimClassifier	Maps a String to its distributional similarity class.
DocumentPreprocessor	Produces a list of sentences from either a plain text or XML document.
LexerTokenizer	An implementation of `Tokenizer` designed to work with `Lexer` implementing classes.
LexerUtils	This class contains various static utility methods invoked by our JFlex NL lexers.
LowercaseAndAmericanizeFunction
LowercaseFunction
Morphology	Morphology computes the base form of English words, by removing just inflections (not derivational morphology).
ProcessMorphologyRequest
PTBEscapingProcessor<IN extends HasWord,L,F>	Produces a new Document of Words in which special characters of the PTB have been properly escaped.
PTBTokenizer<T extends HasWord>	A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text.
PTBTokenizer.PTBTokenizerFactory<T extends HasWord>	This class provides a factory which will vend instances of PTBTokenizer which wrap a provided Reader.
Stemmer	Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form.
StopList	Simple stoplist class.
StoplistFilter<L,F>	Filter which removes stop-listed words.
StripTagsProcessor<L,F>	A `Processor` whose `process` method deletes all SGML/XML/HTML tags (tokens starting with `<` and ending with `>`.
TokenizerAdapter	This class adapts between a `java.io.StreamTokenizer` and a `edu.stanford.nlp.process.Tokenizer`.
TransformXML<T>	Reads XML from an input file or stream and writes XML to an output file or stream, while transforming text appearing inside specified XML tags by applying a specified `Function`.
TransformXML.NoEscapingSAXInterface<T>	This version of the SAXInterface doesn't escape the text produced by the function.
TransformXML.SAXInterface<T>
TSVSentenceIterator	Reads sentences from a TSV, provided a list of fields to populate.
TSVUtils	A set of utilities for parsing TSV files into CoreMaps
WhitespaceTokenizer<T extends HasWord>	A WhitespaceTokenizer is a tokenizer that splits on and discards only whitespace characters.
WhitespaceTokenizer.WhitespaceTokenizerFactory<T extends HasWord>	A factory which vends WhitespaceTokenizers.
WordSegmentingTokenizer	A tokenizer that works by calling a WordSegmenter.
WordShapeClassifier	Provides static methods which map any String to another String indicative of its "word shape" -- e.g., whether capitalized, numeric, etc.
WordTokenFactory	Constructs a Word from a String.
WordToSentenceProcessor<IN>	Transforms a List of words into a List of Lists of words (that is, a List of sentences), by grouping the words.
WordToTaggedWordProcessor<IN extends HasWord,L,F>	Transforms a Document of Words into a document all or partly of TaggedWords by breaking words on a tag divider character.

Enum Summary
Enum	Description
DocumentPreprocessor.DocType
LexerUtils.DashesEnum
LexerUtils.EllipsesEnum
LexerUtils.QuotesEnum
TSVSentenceIterator.SentenceField	A list of possible fields in the sentence table
TSVSentenceProcessor.SentenceField	A list of possible fields in the sentence table.
WordToSentenceProcessor.NewlineIsSentenceBreak

Package edu.stanford.nlp.process Description

Contains classes for processing documents. The key here is the Processor interface, which has a sole Document process(Document) method which takes a document and returns another processed document, which may be parsed, stoplisted, stemmed, etc.

Sepandar David Kamvar

Last modified: Thu Oct 31 11:14:34 PST 2002