Package edu.stanford.nlp.process

Contains classes for processing documents.

See:
          Description

Interface Summary
Feature This provides an interface for a feature that can be used to define a partition over the space of possible unseen words.
FeatureValue This defines an interface for the set of possible values that a Feature can assume.
Function  
ListProcessor User: Teg Grenager Date: Mar 10, 2004 Time: 5:31:05 PM
Processor Top-level interface for transforming Documents.
Tokenizer Tokenizers break up text into individual Objects.
 

Class Summary
AbstractTokenizer Abstract tokenizer.
Americanize Takes a HasWord or String and returns a lowercase version of it.
CapitalFeature Provides a partition over the set of possible unseen words that corresponds to the capitalization of characters in the word.
DummyTokenizer A Tokenizer that splits only on white space (spaces, tabs, and carriage returns).
FunctionProcessor Processor that takes an Function and applies to every element in the input Document.
LexerTokenizer An implementation of Tokenizer designed to work with Lexer implementing classes.
LowercaseProcessor Processor whose process method converts a collection of mixed-case Words to a collection of lowercase Words.
NumAndCapFeature Provides a partition over the set of possible unseen words that corresponds to the capitalization and inclusion of numbers in the word.
NumberFeature Provides a partition over the set of possible unseen words that corresponds to the formatting of numbers in the word.
NumberProcessor Processor whose process method converts a numbers to the word "*NUMBER*"
PTBEscapingProcessor Produces a new Document of Words in which special characters of the PTB have been properly escaped.
PTBTokenizer Tokenizer implementation that conforms to the Penn Treebank tokenization conventions.
SentenceRetokenizingProcessor Transforms a Document of Words into a Document of Sentences by grouping the Words.
SentenceToWordProcessor Transforms a Document of Sentences to a Document of Words by flattening out the Sentences.
SimpleTokenizer Simple Tokenizer implementation that wraps a StreamTokenizer.
Stemmer Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form.
StopList Simple stoplist class.
StoplistFilter Filter which removes stop-listed words.
StripTagsProcessor A Processor whose process method deletes all SGML/XML/HTML tags (tokens starting with < and ending with >.
TransformedFilter Filter that first transforms input before filtering it.
TreeToSentenceFunction Function that turns a Tree into its Sentence yield.
WordExtractor Pulls the word String from a Word.
WordToSentenceProcessor Transforms a Document of Words into a Document of Sentences by grouping the Words.
WordToTaggedWordProcessor Transforms a Document of Words into a document all or partly of TaggedWords by breaking words on a tag divider character.
 

Package edu.stanford.nlp.process Description

Contains classes for processing documents. The key here is the Processor interface, which has a sole Document process(Document) method which takes a document and returns another processed document, which may be parsed, stoplisted, stemmed, etc.


Sepandar David Kamvar
Last modified: Thu Oct 31 11:14:34 PST 2002



Stanford NLP Group