|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Produces a list of sentences from either a plain text or XML document.
Tokenization: The default tokenizer is PTBTokenizer
. If null is passed to
setTokenizerFactory
, then whitespace tokenization is assumed.
Adding a new document type requires two steps:
NOTE: This implementation should not use external libraries since it is used in the parser.
Nested Class Summary | |
---|---|
static class |
DocumentPreprocessor.DocType
|
Constructor Summary | |
---|---|
DocumentPreprocessor(Reader input)
Constructs a preprocessor from an existing input stream. |
|
DocumentPreprocessor(Reader input,
DocumentPreprocessor.DocType t)
|
|
DocumentPreprocessor(String docPath)
Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL. |
|
DocumentPreprocessor(String docPath,
DocumentPreprocessor.DocType t)
|
Method Summary | |
---|---|
void |
close()
Closes the underlying reader. |
Iterator<List<HasWord>> |
iterator()
Returns sentences until the document is exhausted. |
static void |
main(String[] args)
This provides a simple test method for DocumentPreprocessor2. |
void |
setElementDelimiter(String s)
Only read text from between these XML tokens if in XML mode. |
void |
setEncoding(String encoding)
/** Set the character encoding. |
void |
setEscaper(Function<List<HasWord>,List<HasWord>> e)
Set an escaper. |
void |
setSentenceDelimiter(String s)
Make the processor assume that the document is already delimited by the supplied parameter. |
void |
setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters. |
void |
setTagDelimiter(String s)
Split POS tags from tokens. |
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(Reader input)
input
- An existing readerpublic DocumentPreprocessor(Reader input, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(String docPath)
docPath
- public DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t)
Method Detail |
---|
public void setEncoding(String encoding) throws IllegalCharsetNameException
encoding
- The character encoding used by Readers
IllegalCharsetNameException
- If the JVM does not support the named character set.public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
sentenceFinalPuncWords
- public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.
public void setEscaper(Function<List<HasWord>,List<HasWord>> e)
e
- The escaperpublic void setSentenceDelimiter(String s)
s
- The sentence delimiterpublic void setTagDelimiter(String s)
s
- POS tag delimiterpublic void setElementDelimiter(String s)
public Iterator<List<HasWord>> iterator()
iterator
in interface Iterable<List<HasWord>>
public void close()
public static void main(String[] args)
A filename is required. The code doesn't run as a filter currently.
tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.
args
- Command-line arguments
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |