|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Produces a list of sentences from either a plain text or XML document.
Tokenization: The default tokenizer is PTBTokenizer
. If null is passed to
setTokenizerFactory
, then whitespace tokenization is assumed.
Adding a new document type requires two steps:
NOTE: This implementation should not use external libraries since it is used in the parser.
Nested Class Summary | |
---|---|
static class |
DocumentPreprocessor.DocType
|
Constructor Summary | |
---|---|
DocumentPreprocessor(Reader input)
Constructs a preprocessor from an existing input stream. |
|
DocumentPreprocessor(Reader input,
DocumentPreprocessor.DocType t)
|
|
DocumentPreprocessor(String docPath)
Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL. |
|
DocumentPreprocessor(String docPath,
DocumentPreprocessor.DocType t)
|
Method Summary | |
---|---|
Iterator<List<HasWord>> |
iterator()
Returns sentences until the document is exhausted. |
static void |
main(String[] args)
This provides a simple test method for DocumentPreprocessor. |
void |
setElementDelimiter(String s)
Only read text from between these XML tokens if in XML mode. |
void |
setEncoding(String encoding)
Set the character encoding. |
void |
setEscaper(Function<List<HasWord>,List<HasWord>> e)
Set an escaper. |
void |
setSentenceDelimiter(String s)
Make the processor assume that the document is already delimited by the supplied parameter. |
void |
setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters. |
void |
setTagDelimiter(String s)
Split tags from tokens. |
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(Reader input)
input
- An existing readerpublic DocumentPreprocessor(Reader input, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(String docPath)
docPath
- public DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t)
Method Detail |
---|
public void setEncoding(String encoding) throws IllegalCharsetNameException
encoding
- The character encoding used by Readers
IllegalCharsetNameException
- If the JVM does not support the named character set.public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
For newline tokenization, use the argument {"\n"}.
sentenceFinalPuncWords
- public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.
public void setEscaper(Function<List<HasWord>,List<HasWord>> e)
e
- The escaperpublic void setSentenceDelimiter(String s)
s
- The sentence delimiterpublic void setTagDelimiter(String s)
Note that for strings that contain two or more instances of the tag delimiter, the last instance is treated as the split point.
The tag delimiter should not contain any characters that must be escaped in a Java regex.
s
- POS tag delimiterpublic void setElementDelimiter(String s)
public Iterator<List<HasWord>> iterator()
iterator
in interface Iterable<List<HasWord>>
public static void main(String[] args)
A filename is required. The code doesn't run as a filter currently.
tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.
args
- Command-line arguments
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |