|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Fully customizable preprocessor for XML, HTML, and plain text documents.
Can take any of a number of input formats and return a List
of tokenized strings.
Constructor Summary | |
---|---|
DocumentPreprocessor()
Constructs a preprocessor using the default tokenzier: PTBTokenizer . |
|
DocumentPreprocessor(TokenizerFactory tokenizerFactory)
|
Method Summary | |
---|---|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromHTML(java.io.Reader input)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromHTML(java.lang.String fileOrURL)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.io.Reader input)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.io.Reader input,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from a reader. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String fileOrURL)
Reads the file and outputs a list of sentences. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String fileOrURL,
boolean doPTBEscaping,
java.lang.String sentenceDelimiter,
int tagDelimiter)
|
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromText(java.lang.String input,
Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper,
java.lang.String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from text. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.io.Reader input,
java.lang.String splitOnTag,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
java.lang.String splitOnTag)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<java.util.List<? extends HasWord>> |
getSentencesFromXML(java.lang.String fileOrURL,
java.lang.String splitOnTag,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
java.util.List<Word> |
getWordsFromHTML(java.io.Reader input)
|
java.util.List<Word> |
getWordsFromHTML(java.lang.String fileOrURL)
|
java.util.List<Word> |
getWordsFromString(java.lang.String input)
Gets a list of words from a string. |
java.util.List<Word> |
getWordsFromText(java.io.Reader input)
|
java.util.List<Word> |
getWordsFromText(java.lang.String fileOrURL)
Reads the file into a single list of words. |
static void |
main(java.lang.String[] args)
This provides a simple test method for DocumentPreprocessor. |
void |
setEncoding(java.lang.String encoding)
Set the character encoding. |
void |
setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
|
void |
setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
void |
usePTBTokenizer()
|
void |
useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(TokenizerFactory tokenizerFactory)
public DocumentPreprocessor()
PTBTokenizer
.
Method Detail |
---|
public void setEncoding(java.lang.String encoding)
encoding
- public void setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
public void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
newTokenizerFactory
- public void usePTBTokenizer()
public void useWhitespaceTokenizer()
public java.util.List<Word> getWordsFromText(java.lang.String fileOrURL) throws java.io.IOException
fileOrURL
- the path of a text file or URL
java.io.IOException
public java.util.List<Word> getWordsFromText(java.io.Reader input)
input
- a Reader of text
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String fileOrURL) throws java.io.IOException
fileOrURL
- the path of a text file or URL
Sentence
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String fileOrURL, boolean doPTBEscaping, java.lang.String sentenceDelimiter, int tagDelimiter) throws java.io.IOException
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.io.Reader input)
input
- a Reader of text
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.lang.String input, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String sentenceDelimiter, int tagDelimiter) throws java.io.IOException
input
- the path to the filename or URLescaper
- a Function
that takes a List of HasWords and returns an escaped version of those words.
Passing in null
here means that no escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented already, and should be using default sentence delimiters
if non-null, means that sentences have already been segmented, and are delimited with this token.tagDelimiter
-
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromText(java.io.Reader input, Function<java.util.List<HasWord>,java.util.List<HasWord>> escaper, java.lang.String sentenceDelimiter, int tagDelimiter)
input
- the inputescaper
- a Function
that takes a List of HasWords and returns an escaped version of those words.
Passing in null
here means that no escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented already, and should be using default sentence delimiters
if non-null, means that sentences have already been segmented, and are delimited with this token.tagDelimiter
-
public java.util.List<Word> getWordsFromString(java.lang.String input)
input
- string
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, java.lang.String splitOnTag) throws java.io.IOException
WordToSentenceProcessor
.
By default, it does PTBEscaping as well.
fileOrURL
- splitOnTag
- the tag which denotes text boundaries
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.lang.String fileOrURL, java.lang.String splitOnTag, boolean doPTBEscaping) throws java.io.IOException
WordToSentenceProcessor
.
fileOrURL
- splitOnTag
- the tag which denotes text boundariesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromXML(java.io.Reader input, java.lang.String splitOnTag, boolean doPTBEscaping)
input
- splitOnTag
- the tag which denotes text boundariesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
public java.util.List<Word> getWordsFromHTML(java.lang.String fileOrURL) throws java.io.IOException
java.io.IOException
public java.util.List<Word> getWordsFromHTML(java.io.Reader input)
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromHTML(java.lang.String fileOrURL) throws java.io.IOException
java.io.IOException
public java.util.List<java.util.List<? extends HasWord>> getSentencesFromHTML(java.io.Reader input)
public static void main(java.lang.String[] args) throws java.io.IOException
java.io.IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |