Package edu.stanford.nlp.process

Contains classes for processing documents.

See:
          Description

Interface Summary
Function<T1,T2> An interface for classes that act as a function transforming one object to another.
LexedTokenFactory Constructs a token (of arbitrary type) from a String and its position in the underlying text.
ListProcessor An interface for things that operate on a List.
Processor Top-level interface for transforming Documents.
Tokenizer<T> Tokenizers break up text into individual Objects.
 

Class Summary
AbstractListProcessor Class AbstractListProcessor
AbstractTokenizer<T> An abstract tokenizer.
Americanize Takes a HasWord or String and returns a lowercase version of it.
DocumentPreprocessor Fully customizable preprocessor for XML, HTML, and plain text documents.
PTBEscapingProcessor Produces a new Document of Words in which special characters of the PTB have been properly escaped.
PTBTokenizer Tokenizer implementation that conforms to the Penn Treebank tokenization conventions.
PTBTokenizer.PTBTokenizerFactory  
StripTagsProcessor A Processor whose process method deletes all SGML/XML/HTML tags (tokens starting with < and ending with >.
WhitespaceTokenizer A WhitespaceTokenizer is a tokenizer that splits on and discards only whitespace characters.
WordTokenFactory Constructs a Word from a String.
WordToSentenceProcessor Transforms a Document of Words into a Document of Sentences by grouping the Words.
WordToTaggedWordProcessor Transforms a Document of Words into a document all or partly of TaggedWords by breaking words on a tag divider character.
 

Package edu.stanford.nlp.process Description

Contains classes for processing documents. The key here is the Processor interface, which has a sole Document process(Document) method which takes a document and returns another processed document, which may be parsed, stoplisted, stemmed, etc.


Sepandar David Kamvar
Last modified: Thu Oct 31 11:14:34 PST 2002