L
- label typepublic class DocumentReader<L>
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
protected java.io.BufferedReader |
in
Reader used to read in document text.
|
protected boolean |
keepOriginalText
Whether to keep source text in document along with tokenized words.
|
protected TokenizerFactory<? extends HasWord> |
tokenizerFactory
Tokenizer used to chop up document text into words.
|
Constructor and Description |
---|
DocumentReader()
Constructs a new DocumentReader without an initial input source.
|
DocumentReader(java.io.Reader in)
Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
|
DocumentReader(java.io.Reader in,
TokenizerFactory<? extends HasWord> tokenizerFactory,
boolean keepOriginalText)
Constructs a new DocumentReader that will read text from the given
Reader and tokenize it into words using the given Tokenizer.
|
Modifier and Type | Method and Description |
---|---|
static java.io.BufferedReader |
getBufferedReader(java.io.Reader in)
Wraps the given Reader in a BufferedReader or returns it directly if it
is already a BufferedReader.
|
boolean |
getKeepOriginalText()
Returns whether created documents will store their source text along with tokenized words.
|
java.io.Reader |
getReader()
Returns the reader for the text input source of this DocumentReader.
|
static java.io.Reader |
getReader(java.io.File file)
Returns a Reader that reads in the given file.
|
static java.io.Reader |
getReader(java.io.InputStream in)
Returns a Reader that reads in the given InputStream.
|
static java.io.Reader |
getReader(java.lang.String text)
Returns a Reader that reads in the given text.
|
static java.io.Reader |
getReader(java.net.URL url)
Returns a Reader that reads in the given URL.
|
TokenizerFactory<? extends HasWord> |
getTokenizerFactory()
Returns the tokenizer used to chop up text into words for the documents.
|
protected BasicDocument<L> |
parseDocumentText(java.lang.String text)
Creates a new Document for the given text.
|
BasicDocument<L> |
readDocument()
Reads the next document's worth of text from the reader and turns it into
a Document.
|
protected java.lang.String |
readNextDocumentText()
Reads the next document's worth of text from the reader.
|
static java.lang.String |
readText(java.io.Reader in)
Returns everything that can be read from the given Reader as a String.
|
void |
setKeepOriginalText(boolean keepOriginalText)
Sets whether created documents should store their source text along with tokenized words.
|
void |
setReader(java.io.Reader in)
Sets the reader from which to read and create documents.
|
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
Sets the tokenizer used to chop up text into words for the documents.
|
protected java.io.BufferedReader in
protected TokenizerFactory<? extends HasWord> tokenizerFactory
protected boolean keepOriginalText
public DocumentReader()
setReader(java.io.Reader)
before trying to read any documents.
Uses a PTBTokenizer and keeps original text.public DocumentReader(java.io.Reader in)
in
- The Readerpublic DocumentReader(java.io.Reader in, TokenizerFactory<? extends HasWord> tokenizerFactory, boolean keepOriginalText)
public java.io.Reader getReader()
public void setReader(java.io.Reader in)
public TokenizerFactory<? extends HasWord> getTokenizerFactory()
public void setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
public boolean getKeepOriginalText()
public void setKeepOriginalText(boolean keepOriginalText)
public BasicDocument<L> readDocument() throws java.io.IOException
readNextDocumentText()
and passes it to parseDocumentText(java.lang.String)
to create the document.
Subclasses may wish to override either or both of those methods to handle
custom formats of document collections and individual documents
respectively. This method can also be overridden in its entirety to
provide custom reading and construction of documents from input text.java.io.IOException
protected java.lang.String readNextDocumentText() throws java.io.IOException
java.io.IOException
protected BasicDocument<L> parseDocumentText(java.lang.String text)
public static java.io.BufferedReader getBufferedReader(java.io.Reader in)
public static java.lang.String readText(java.io.Reader in) throws java.io.IOException
java.io.IOException
public static java.io.Reader getReader(java.lang.String text)
public static java.io.Reader getReader(java.io.File file) throws java.io.FileNotFoundException
java.io.FileNotFoundException
public static java.io.Reader getReader(java.net.URL url) throws java.io.IOException
java.io.IOException
public static java.io.Reader getReader(java.io.InputStream in)