|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Produces a list of sentences from either a plain text or XML document.
Tokenization: The default tokenizer is PTBTokenizer
. If null is passed to
setTokenizerFactory
, then whitespace tokenization is assumed.
Adding a new document type requires two steps:
NOTE: This implementation should not use external libraries since it is used in the parser.
Nested Class Summary | |
---|---|
static class |
DocumentPreprocessor.DocType
|
Constructor Summary | |
---|---|
DocumentPreprocessor(java.io.Reader input)
Constructs a preprocessor from an existing input stream. |
|
DocumentPreprocessor(java.io.Reader input,
DocumentPreprocessor.DocType t)
|
|
DocumentPreprocessor(java.lang.String docPath)
Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL. |
|
DocumentPreprocessor(java.lang.String docPath,
DocumentPreprocessor.DocType t)
|
Method Summary | |
---|---|
java.util.Iterator<java.util.List<HasWord>> |
iterator()
Returns sentences until the document is exhausted. |
static void |
main(java.lang.String[] args)
This provides a simple test method for DocumentPreprocessor. |
void |
setElementDelimiter(java.lang.String s)
Only read text from inside these XML elements if in XML mode. |
void |
setEncoding(java.lang.String encoding)
Set the character encoding. |
void |
setEscaper(Function<java.util.List<HasWord>,java.util.List<HasWord>> e)
Set an escaper. |
void |
setSentenceDelimiter(java.lang.String s)
Make the processor assume that the document is already delimited by the supplied parameter. |
void |
setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters. |
void |
setTagDelimiter(java.lang.String s)
Split tags from tokens. |
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(java.io.Reader input)
input
- An existing readerpublic DocumentPreprocessor(java.io.Reader input, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(java.lang.String docPath)
docPath
- public DocumentPreprocessor(java.lang.String docPath, DocumentPreprocessor.DocType t)
Method Detail |
---|
public void setEncoding(java.lang.String encoding) throws java.nio.charset.IllegalCharsetNameException
encoding
- The character encoding used by Readers
java.nio.charset.IllegalCharsetNameException
- If the JVM does not support the named character set.public void setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
For newline tokenization, use the argument {"\n"}.
sentenceFinalPuncWords
- public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.
public void setEscaper(Function<java.util.List<HasWord>,java.util.List<HasWord>> e)
e
- The escaperpublic void setSentenceDelimiter(java.lang.String s)
s
- The sentence delimiterpublic void setTagDelimiter(java.lang.String s)
Note that for strings that contain two or more instances of the tag delimiter, the last instance is treated as the split point.
The tag delimiter should not contain any characters that must be escaped in a Java regex.
s
- POS tag delimiterpublic void setElementDelimiter(java.lang.String s)
public java.util.Iterator<java.util.List<HasWord>> iterator()
iterator
in interface java.lang.Iterable<java.util.List<HasWord>>
public static void main(java.lang.String[] args) throws java.io.IOException
A filename is required. The code doesn't run as a filter currently.
tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.
args
- Command-line arguments
java.io.IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |