edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  extended by edu.stanford.nlp.process.DocumentPreprocessor

public class DocumentPreprocessor
extends Object

Fully customizable preprocessor for XML, HTML, and plain text documents. Can take any of a number of input formats and return a List of tokenized strings.

Author:
Chris Cox, Jenny Finkel

Constructor Summary
DocumentPreprocessor()
          Constructs a preprocessor using the default tokenzier: PTBTokenizer.
DocumentPreprocessor(TokenizerFactory tokenizerFactory)
           
 
Method Summary
 List<List<? extends HasWord>> getSentencesFromHTML(Reader input)
           
 List<List<? extends HasWord>> getSentencesFromHTML(String fileOrURL)
           
 List<List<? extends HasWord>> getSentencesFromText(Reader input)
           
 List<List<? extends HasWord>> getSentencesFromText(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from a reader.
 List<List<? extends HasWord>> getSentencesFromText(String fileOrURL)
          Reads the file and outputs a list of sentences.
 List<List<? extends HasWord>> getSentencesFromText(String fileOrURL, boolean doPTBEscaping, String sentenceDelimiter, int tagDelimiter)
           
 List<List<? extends HasWord>> getSentencesFromText(String input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from text.
 List<List<? extends HasWord>> getSentencesFromXML(Reader input, String splitOnTag, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, String splitOnTag)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, String splitOnTag, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<Word> getWordsFromHTML(Reader input)
           
 List<Word> getWordsFromHTML(String fileOrURL)
           
 List<Word> getWordsFromString(String input)
          Gets a list of words from a string.
 List<Word> getWordsFromText(Reader input)
           
 List<Word> getWordsFromText(String fileOrURL)
          Reads the file into a single list of words.
static void main(String[] args)
          This provides a simple test method for DocumentPreprocessor.
 void setEncoding(String encoding)
          Set the character encoding.
 void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
           
 void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
          Sets the factory from which to produce a Tokenizer.
 void usePTBTokenizer()
           
 void useWhitespaceTokenizer()
          Use tokenizers which tokenize on whitespace.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(TokenizerFactory tokenizerFactory)

DocumentPreprocessor

public DocumentPreprocessor()
Constructs a preprocessor using the default tokenzier: PTBTokenizer.

Method Detail

setEncoding

public void setEncoding(String encoding)
Set the character encoding.

Parameters:
encoding -

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.

Parameters:
newTokenizerFactory -

usePTBTokenizer

public void usePTBTokenizer()

useWhitespaceTokenizer

public void useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace.


getWordsFromText

public List<Word> getWordsFromText(String fileOrURL)
                            throws IOException
Reads the file into a single list of words.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type Word representing words
Throws:
IOException

getWordsFromText

public List<Word> getWordsFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of objects of type Word representing words

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String fileOrURL)
                                                   throws IOException
Reads the file and outputs a list of sentences.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type Sentence
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String fileOrURL,
                                                          boolean doPTBEscaping,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
                                                   throws IOException
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of sentences

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String input,
                                                          Function<List<HasWord>,List<HasWord>> escaper,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
                                                   throws IOException
Produce a list of sentences from text.

Parameters:
input - the path to the filename or URL
escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.
tagDelimiter -
Returns:
a List of sentences
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(Reader input,
                                                          Function<List<HasWord>,List<HasWord>> escaper,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
Produce a list of sentences from a reader.

Parameters:
input - the input
escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.
tagDelimiter -
Returns:
a list of sentences

getWordsFromString

public List<Word> getWordsFromString(String input)
Gets a list of words from a string.

Parameters:
input - string
Returns:
a List of objects of type Word representing words

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         String splitOnTag)
                                                  throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor. By default, it does PTBEscaping as well.

Parameters:
fileOrURL -
splitOnTag - the tag which denotes text boundaries
Returns:
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         String splitOnTag,
                                                         boolean doPTBEscaping)
                                                  throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor.

Parameters:
fileOrURL -
splitOnTag - the tag which denotes text boundaries
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(Reader input,
                                                         String splitOnTag,
                                                         boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
input -
splitOnTag - the tag which denotes text boundaries
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:

getWordsFromHTML

public List<Word> getWordsFromHTML(String fileOrURL)
                            throws IOException
Throws:
IOException

getWordsFromHTML

public List<Word> getWordsFromHTML(Reader input)

getSentencesFromHTML

public List<List<? extends HasWord>> getSentencesFromHTML(String fileOrURL)
                                                   throws IOException
Throws:
IOException

getSentencesFromHTML

public List<List<? extends HasWord>> getSentencesFromHTML(Reader input)

main

public static void main(String[] args)
                 throws IOException
This provides a simple test method for DocumentPreprocessor. Usage: DocumentPreprocessor -file filename [-xml tag|-html] [-noSplitSentence]

Throws:
IOException


Stanford NLP Group