public class DocumentPreprocessor extends Object implements Iterable<List<HasWord>>
Tokenization: The default tokenizer is PTBTokenizer
. If null is passed
to setTokenizerFactory
, then whitespace tokenization is assumed.
Adding a new document type requires two steps:
NOTE: This implementation should not use external libraries since it is used in the parser.
Modifier and Type | Class and Description |
---|---|
static class |
DocumentPreprocessor.DocType |
Modifier and Type | Field and Description |
---|---|
static String[] |
DEFAULT_SENTENCE_DELIMS |
Constructor and Description |
---|
DocumentPreprocessor(Reader input)
Constructs a preprocessor from an existing input stream.
|
DocumentPreprocessor(Reader input,
DocumentPreprocessor.DocType t) |
DocumentPreprocessor(String docPath) |
DocumentPreprocessor(String docPath,
DocumentPreprocessor.DocType t) |
DocumentPreprocessor(String docPath,
DocumentPreprocessor.DocType t,
String encoding)
Constructs a preprocessor from a file at a path, which can be either
a filesystem location, a classpath entry, or a URL.
|
Modifier and Type | Method and Description |
---|---|
Iterator<List<HasWord>> |
iterator()
Returns sentences until the document is exhausted.
|
static void |
main(String[] args)
This provides a simple test method for DocumentPreprocessor.
|
void |
setElementDelimiter(String s)
Only read text from inside these XML elements if in XML mode.
|
void |
setEscaper(java.util.function.Function<List<HasWord>,List<HasWord>> e)
Set an escaper.
|
void |
setKeepEmptySentences(boolean keepEmptySentences)
Set whether or not the tokenizer keeps empty sentences in
whitespace mode.
|
void |
setSentenceDelimiter(String s)
Make the processor assume that the document is already delimited
by the supplied parameter.
|
void |
setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters.
|
void |
setTagDelimiter(String s)
Split tags from tokens.
|
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a
Tokenizer . |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
public static final String[] DEFAULT_SENTENCE_DELIMS
public DocumentPreprocessor(Reader input)
input
- An existing readerpublic DocumentPreprocessor(Reader input, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(String docPath)
public DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t, String encoding)
docPath
- The pathencoding
- The character encoding used by Readerspublic void setKeepEmptySentences(boolean keepEmptySentences)
public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
For newline tokenization, use the argument {"\n"}.
sentenceFinalPuncWords
- public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.
public void setEscaper(java.util.function.Function<List<HasWord>,List<HasWord>> e)
e
- The escaperpublic void setSentenceDelimiter(String s)
s
- The sentence delimiterpublic void setTagDelimiter(String s)
Note that for strings that contain two or more instances of the tag delimiter, the last instance is treated as the split point.
The tag delimiter should not contain any characters that must be escaped in a Java regex.
s
- POS tag delimiterpublic void setElementDelimiter(String s)
public Iterator<List<HasWord>> iterator()
public static void main(String[] args) throws IOException
A filename is required. The code doesn't run as a filter currently.
tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.
args
- Command-line argumentsIOException