|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Fully customizable preprocessor for XML, HTML, and PLAIN text documents.
Can take any of a number of input formats and return a List
of tokenized strings.
Constructor Summary | |
---|---|
DocumentPreprocessor()
Constructs a preprocessor using the default tokenizer: PTBTokenizer . |
|
DocumentPreprocessor(boolean suppressEscaping)
Constructs a preprocessor using the default tokenizer: PTBTokenizer . |
|
DocumentPreprocessor(TokenizerFactory<? extends HasWord> tokenizerFactory)
|
Method Summary | |
---|---|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromHTML(java.io.Reader input)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromHTML(java.lang.String fileOrURL)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.io.Reader input)
Get sentence of text from a reader. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.io.Reader input,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from a Reader. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.io.Reader input,
java.lang.String sentenceDelimiter)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String fileOrURL)
Reads a file or URL and outputs a list of sentences. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String fileOrURL,
boolean doPTBEscaping,
java.lang.String sentenceDelimiter,
int tagDelimiter)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String input,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from text. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.io.Reader input,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String splitOnTag,
java.lang.String sentenceDelimiter)
Returns a list of sentences contained in an XML file, occurring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.io.Reader input,
java.lang.String splitOnTag,
java.lang.String sentenceDelimiter,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String splitOnTag)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String splitOnTag,
java.lang.String sentenceDelimiter)
Returns a list of sentences contained in an XML file, occurring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
java.lang.String splitOnTag)
Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
java.lang.String splitOnTag,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file or URL, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
java.lang.String splitOnTag,
java.lang.String sentenceDelimiter,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<Word> |
getWordsFromHTML(java.io.Reader input)
|
java.util.List<Word> |
getWordsFromHTML(java.lang.String fileOrURL)
|
java.util.List<Word> |
getWordsFromString(java.lang.String input)
Gets a list of words from a string. |
java.util.List<Word> |
getWordsFromText(java.io.Reader input)
|
java.util.List<Word> |
getWordsFromText(java.lang.String fileOrURL)
Reads the file into a single list of words. |
static void |
main(java.lang.String[] args)
This provides a simple test method for DocumentPreprocessor. |
void |
setEncoding(java.lang.String encoding)
Set the character encoding. |
void |
setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
|
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
void |
usePTBTokenizer()
|
void |
useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(TokenizerFactory<? extends HasWord> tokenizerFactory)
public DocumentPreprocessor()
PTBTokenizer
.
public DocumentPreprocessor(boolean suppressEscaping)
PTBTokenizer
.
and sets the suppressEscaping flag.
Method Detail |
---|
public void setEncoding(java.lang.String encoding)
encoding
- The character encoding used by Readerspublic void setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
public void usePTBTokenizer()
public void useWhitespaceTokenizer()
public java.util.List<Word> getWordsFromText(java.lang.String fileOrURL) throws java.io.IOException
fileOrURL
- the path of a text file or URL
java.io.IOException
public java.util.List<Word> getWordsFromText(java.io.Reader input)
input
- a Reader of text
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String fileOrURL) throws java.io.IOException
fileOrURL
- the path of a text file or URL
List<List<? extends HasWord>>
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String fileOrURL, boolean doPTBEscaping, java.lang.String sentenceDelimiter, int tagDelimiter) throws java.io.IOException
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.io.Reader input)
input
- a Reader of text
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String input, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String sentenceDelimiter, int tagDelimiter) throws java.io.IOException
input
- the path to the filename or URLescaper
- a Function
that takes a List of HasWords and returns an escaped version of those words.
Passing in null
here means that no escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented already, and should be using default sentence delimiters
if non-null, means that sentences have already been segmented, and are delimited with this token.
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.io.Reader input, java.lang.String sentenceDelimiter)
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.io.Reader input, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String sentenceDelimiter, int tagDelimiter)
input
- The input Readerescaper
- A Function
that takes a List of HasWords
and returns an escaped version of those words.
Passing in null
here means that no
escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented
already, and should be using default sentence
delimiters; if non-null, means that sentences
have already been segmented, and are delimited
with this token.tagDelimiter
- A character, the rightmost instance of which in a
token is taken to separate the word from a
POS tag. A negative number if there are no
POS tags to separate off.
public java.util.List<Word> getWordsFromString(java.lang.String input)
input
- string
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, java.lang.String splitOnTag) throws java.io.IOException
WordToSentenceProcessor
.
By default, it does PTBEscaping as well.
splitOnTag
- the tag which denotes text boundaries
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, java.lang.String splitOnTag, boolean doPTBEscaping) throws java.io.IOException
WordToSentenceProcessor
.
By default, it does PTBEscaping as well.
splitOnTag
- the tag which denotes text boundaries
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, java.lang.String splitOnTag, java.lang.String sentenceDelimiter, boolean doPTBEscaping) throws java.io.IOException
WordToSentenceProcessor
.
splitOnTag
- the tag which denotes text boundariesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.io.Reader input, java.lang.String splitOnTag, java.lang.String sentenceDelimiter, boolean doPTBEscaping)
splitOnTag
- the tag which denotes text boundariessentenceDelimiter
- The text that separates sentencesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String splitOnTag) throws java.io.IOException
escaper
- An escaper to use.splitOnTag
- the tag which denotes text boundaries
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String splitOnTag, java.lang.String sentenceDelimiter) throws java.io.IOException
fileOrURL
- The filename or URL to get input fromescaper
- An escaper to use on each sentence.splitOnTag
- The XML element which denotes text boundaries to be
processed. This is a regular expression which
should match the element name(s) (i.e., specified
without the angle brackets).sentenceDelimiter
- A String that will split sentences, including
the special values "newline" or "onePerElement"
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.io.Reader input, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String splitOnTag, java.lang.String sentenceDelimiter)
input
- The Reader to get input fromescaper
- An escaper to use on each sentence.splitOnTag
- The XML element which denotes text boundaries to be
processed. This is a regular expression which
should match the element name(s) (i.e., specified
without the angle brackets).sentenceDelimiter
- A String that will split sentences, including
the special values "newline" or "onePerElement"
public java.util.List<Word> getWordsFromHTML(java.lang.String fileOrURL) throws java.io.IOException
java.io.IOException
public java.util.List<Word> getWordsFromHTML(java.io.Reader input)
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromHTML(java.lang.String fileOrURL) throws java.io.IOException
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromHTML(java.io.Reader input)
public static void main(java.lang.String[] args) throws java.io.IOException
A filename is required. The code doesn't run as a filter currently.
tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'. The -noSplitSentence flag suppresses the normal splitting into sentences using PTBTokenizer and WordToSentenceProcessor
args
- Command-line arguments
java.io.IOException
- If file isn't openable, etc.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |