edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  extended by edu.stanford.nlp.process.DocumentPreprocessor

public class DocumentPreprocessor
extends Object

Fully customizable preprocessor for XML, HTML, and PLAIN text documents. Can take any of a number of input formats and return a List of tokenized strings.

Author:
Chris Cox, Jenny Finkel

Constructor Summary
DocumentPreprocessor()
          Constructs a preprocessor using the default tokenzier: PTBTokenizer.
DocumentPreprocessor(boolean suppressEscaping)
          Constructs a preprocessor using the default tokenizer: PTBTokenizer.
DocumentPreprocessor(TokenizerFactory<? extends HasWord> tokenizerFactory)
           
 
Method Summary
 List<List<? extends HasWord>> getSentencesFromHTML(Reader input)
           
 List<List<? extends HasWord>> getSentencesFromHTML(String fileOrURL)
           
 List<List<? extends HasWord>> getSentencesFromText(Reader input)
           
 List<List<? extends HasWord>> getSentencesFromText(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from a Reader.
 List<List<? extends HasWord>> getSentencesFromText(Reader input, String sentenceDelimiter)
           
 List<List<? extends HasWord>> getSentencesFromText(String fileOrURL)
          Reads a file or URL and outputs a list of sentences.
 List<List<? extends HasWord>> getSentencesFromText(String fileOrURL, boolean doPTBEscaping, String sentenceDelimiter, int tagDelimiter)
           
 List<List<? extends HasWord>> getSentencesFromText(String input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from text.
 List<List<? extends HasWord>> getSentencesFromXML(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String splitOnTag, String sentenceDelimiter)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(Reader input, String splitOnTag, String sentenceDelimiter, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, Function<List<HasWord>,List<HasWord>> escaper, String splitOnTag)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, Function<List<HasWord>,List<HasWord>> escaper, String splitOnTag, String sentenceDelimiter)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, String splitOnTag)
          Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, String splitOnTag, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag.
 List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL, String splitOnTag, String sentenceDelimiter, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List<Word> getWordsFromHTML(Reader input)
           
 List<Word> getWordsFromHTML(String fileOrURL)
           
 List<Word> getWordsFromString(String input)
          Gets a list of words from a string.
 List<Word> getWordsFromText(Reader input)
           
 List<Word> getWordsFromText(String fileOrURL)
          Reads the file into a single list of words.
static void main(String[] args)
          This provides a simple test method for DocumentPreprocessor.
 void setEncoding(String encoding)
          Set the character encoding.
 void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
           
 void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
          Sets the factory from which to produce a Tokenizer.
 void usePTBTokenizer()
           
 void useWhitespaceTokenizer()
          Use tokenizers which tokenize on whitespace.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(TokenizerFactory<? extends HasWord> tokenizerFactory)

DocumentPreprocessor

public DocumentPreprocessor()
Constructs a preprocessor using the default tokenzier: PTBTokenizer.


DocumentPreprocessor

public DocumentPreprocessor(boolean suppressEscaping)
Constructs a preprocessor using the default tokenizer: PTBTokenizer. and sets the supressEscaping flag.

Method Detail

setEncoding

public void setEncoding(String encoding)
Set the character encoding.

Parameters:
encoding - The character encoding used by Readers

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.


usePTBTokenizer

public void usePTBTokenizer()

useWhitespaceTokenizer

public void useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace.


getWordsFromText

public List<Word> getWordsFromText(String fileOrURL)
                            throws IOException
Reads the file into a single list of words.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type Word representing words
Throws:
IOException

getWordsFromText

public List<Word> getWordsFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of objects of type Word representing words

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String fileOrURL)
                                                   throws IOException
Reads a file or URL and outputs a list of sentences.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type List>
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String fileOrURL,
                                                          boolean doPTBEscaping,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
                                                   throws IOException
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of sentences

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(String input,
                                                          Function<List<HasWord>,List<HasWord>> escaper,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
                                                   throws IOException
Produce a list of sentences from text.

Parameters:
input - the path to the filename or URL
escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.
Returns:
a List of sentences
Throws:
IOException

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(Reader input,
                                                          String sentenceDelimiter)

getSentencesFromText

public List<List<? extends HasWord>> getSentencesFromText(Reader input,
                                                          Function<List<HasWord>,List<HasWord>> escaper,
                                                          String sentenceDelimiter,
                                                          int tagDelimiter)
Produce a list of sentences from a Reader.

Parameters:
input - The input Reader
escaper - A Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters; if non-null, means that sentences have already been segmented, and are delimited with this token.
tagDelimiter - A character, the rightmost instance of which in a token is taken to separate the word from a POS tag. A negative number if there are no POS tags to separate off.
Returns:
A list of sentences

getWordsFromString

public List<Word> getWordsFromString(String input)
Gets a list of words from a string.

Parameters:
input - string
Returns:
a List of objects of type Word representing words

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         String splitOnTag)
                                                  throws IOException
Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor. By default, it does PTBEscaping as well.

Parameters:
splitOnTag - the tag which denotes text boundaries
Returns:
A list of sentences
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         String splitOnTag,
                                                         boolean doPTBEscaping)
                                                  throws IOException
Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor. By default, it does PTBEscaping as well.

Parameters:
splitOnTag - the tag which denotes text boundaries
Returns:
A list of sentences
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         String splitOnTag,
                                                         String sentenceDelimiter,
                                                         boolean doPTBEscaping)
                                                  throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor.

Parameters:
splitOnTag - the tag which denotes text boundaries
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:
A list of sentences contained in an XML file
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(Reader input,
                                                         String splitOnTag,
                                                         String sentenceDelimiter,
                                                         boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
splitOnTag - the tag which denotes text boundaries
sentenceDelimiter - The text that separates sentences
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:
A list of sentences contained in an XML file

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         Function<List<HasWord>,List<HasWord>> escaper,
                                                         String splitOnTag)
                                                  throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
escaper - An escaper to use.
splitOnTag - the tag which denotes text boundaries
Returns:
A list of sentences contained in an XML file
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(String fileOrURL,
                                                         Function<List<HasWord>,List<HasWord>> escaper,
                                                         String splitOnTag,
                                                         String sentenceDelimiter)
                                                  throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
fileOrURL - The filename or URL to get input from
escaper - An escaper to use on each sentence.
splitOnTag - The XML element which denotes text boundaries to be processed. This is a regular expression which should match the element name(s) (i.e., specified without the angle brackets).
sentenceDelimiter - A String that will split sentences, including the special values "newline" or "onePerElement"
Returns:
A list of sentences contained in an XML file
Throws:
IOException

getSentencesFromXML

public List<List<? extends HasWord>> getSentencesFromXML(Reader input,
                                                         Function<List<HasWord>,List<HasWord>> escaper,
                                                         String splitOnTag,
                                                         String sentenceDelimiter)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
input - The Reader to get input from
escaper - An escaper to use on each sentence.
splitOnTag - The XML element which denotes text boundaries to be processed. This is a regular expression which should match the element name(s) (i.e., specified without the angle brackets).
sentenceDelimiter - A String that will split sentences, including the special values "newline" or "onePerElement"
Returns:
A list of sentences contained in an XML file

getWordsFromHTML

public List<Word> getWordsFromHTML(String fileOrURL)
                            throws IOException
Throws:
IOException

getWordsFromHTML

public List<Word> getWordsFromHTML(Reader input)

getSentencesFromHTML

public List<List<? extends HasWord>> getSentencesFromHTML(String fileOrURL)
                                                   throws IOException
Throws:
IOException

getSentencesFromHTML

public List<List<? extends HasWord>> getSentencesFromHTML(Reader input)

main

public static void main(String[] args)
                 throws IOException
This provides a simple test method for DocumentPreprocessor.
Usage: java DocumentPreprocessor -file filename [-xml tag|-html] [-noSplitSentence]

A filename is required. The code doesn't run as a filter currently.

tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'. The -noSplitSentence flag suppresses the normal splitting into sentences using PTBTokenizer and WordToSentenceProcessor

Parameters:
args - Command-line arguments
Throws:
IOException - If file isn't openable, etc.


Stanford NLP Group