|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
Feature | This provides an interface for a feature that can be used to define a partition over the space of possible unseen words. |
FeatureValue | This defines an interface for the set of possible values that a Feature can assume. |
Function | |
ListProcessor | User: Teg Grenager Date: Mar 10, 2004 Time: 5:31:05 PM |
Processor | Top-level interface for transforming Documents. |
Tokenizer | Tokenizers break up text into individual Objects. |
Class Summary | |
AbstractTokenizer | Abstract tokenizer. |
Americanize | Takes a HasWord or String and returns a lowercase version of it. |
CapitalFeature | Provides a partition over the set of possible unseen words that corresponds to the capitalization of characters in the word. |
DummyTokenizer | A Tokenizer that splits only on white space (spaces, tabs, and carriage returns). |
FunctionProcessor | Processor that takes an Function and applies to every element in the input Document. |
LexerTokenizer | An implementation of Tokenizer designed to work with
Lexer implementing classes. |
LowercaseProcessor | Processor whose process method converts a
collection of mixed-case Words to a collection of lowercase Words. |
NumAndCapFeature | Provides a partition over the set of possible unseen words that corresponds to the capitalization and inclusion of numbers in the word. |
NumberFeature | Provides a partition over the set of possible unseen words that corresponds to the formatting of numbers in the word. |
NumberProcessor | Processor whose process method converts a
numbers to the word "*NUMBER*" |
PTBEscapingProcessor | Produces a new Document of Words in which special characters of the PTB have been properly escaped. |
PTBTokenizer | Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. |
SentenceRetokenizingProcessor | Transforms a Document of Words into a Document of Sentences by grouping the Words. |
SentenceToWordProcessor | Transforms a Document of Sentences to a Document of Words by flattening out the Sentences. |
SimpleTokenizer | Simple Tokenizer implementation that wraps a StreamTokenizer. |
Stemmer | Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form. |
StopList | Simple stoplist class. |
StoplistFilter | Filter which removes stop-listed words. |
StripTagsProcessor | A Processor whose process method deletes all
SGML/XML/HTML tags (tokens starting with < and ending
with > |
TransformedFilter | Filter that first transforms input before filtering it. |
TreeToSentenceFunction | Function that turns a Tree into its Sentence yield. |
WordExtractor | Pulls the word String from a Word. |
WordToSentenceProcessor | Transforms a Document of Words into a Document of Sentences by grouping the Words. |
WordToTaggedWordProcessor | Transforms a Document of Words into a document all or partly of TaggedWords by breaking words on a tag divider character. |
Contains classes for processing documents. The key here is the Processor
interface, which has a sole Document process(Document)
method
which takes a document and returns another document, which may
be parsed, stoplisted, stemmed, etc.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |