DocumentReader (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.ling.DocumentReader<L>

Type Parameters:

L - label type
```
public class DocumentReader<L>
extends java.lang.Object
```
Basic mechanism for reading in Documents from various input sources. This default implementation can read from strings, files, URLs, and InputStreams and can use a given Tokenizer to turn the text into words. When working with a new data format, make a new DocumentReader to parse it and then use it with the existing Document APIs (rather than having to make new Document classes). Use the protected class variables (in, tokenizer, keepOriginalText) to read text and create docs appropriately. Subclasses should ideally provide similar constructors to this class, though only the constructor that takes a Reader is required.

Author:

Joseph Smarr (jsmarr@stanford.edu), Sarah Spikes (sdspikes@cs.stanford.edu) - templatized

Field Summary

Fields
Modifier and Type	Field and Description
`protected java.io.BufferedReader`	`in` Reader used to read in document text.
`protected boolean`	`keepOriginalText` Whether to keep source text in document along with tokenized words.
`protected TokenizerFactory<? extends HasWord>`	`tokenizerFactory` Tokenizer used to chop up document text into words.

Constructor Summary

Constructors
Constructor and Description
`DocumentReader()` Constructs a new DocumentReader without an initial input source.
`DocumentReader(java.io.Reader in)` Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
`DocumentReader(java.io.Reader in, TokenizerFactory<? extends HasWord> tokenizerFactory, boolean keepOriginalText)` Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static java.io.BufferedReader`	`getBufferedReader(java.io.Reader in)` Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader.
`boolean`	`getKeepOriginalText()` Returns whether created documents will store their source text along with tokenized words.
`java.io.Reader`	`getReader()` Returns the reader for the text input source of this DocumentReader.
`static java.io.Reader`	`getReader(java.io.File file)` Returns a Reader that reads in the given file.
`static java.io.Reader`	`getReader(java.io.InputStream in)` Returns a Reader that reads in the given InputStream.
`static java.io.Reader`	`getReader(java.lang.String text)` Returns a Reader that reads in the given text.
`static java.io.Reader`	`getReader(java.net.URL url)` Returns a Reader that reads in the given URL.
`TokenizerFactory<? extends HasWord>`	`getTokenizerFactory()` Returns the tokenizer used to chop up text into words for the documents.
`protected BasicDocument<L>`	`parseDocumentText(java.lang.String text)` Creates a new Document for the given text.
`BasicDocument<L>`	`readDocument()` Reads the next document's worth of text from the reader and turns it into a Document.
`protected java.lang.String`	`readNextDocumentText()` Reads the next document's worth of text from the reader.
`static java.lang.String`	`readText(java.io.Reader in)` Returns everything that can be read from the given Reader as a String.
`void`	`setKeepOriginalText(boolean keepOriginalText)` Sets whether created documents should store their source text along with tokenized words.
`void`	`setReader(java.io.Reader in)` Sets the reader from which to read and create documents.
`void`	`setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)` Sets the tokenizer used to chop up text into words for the documents.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - in
```
protected java.io.BufferedReader in
```
    Reader used to read in document text. In default implementation, this is guaranteed to be a BufferedReader (so cast down) but it's typed as Reader in case subclasses don't want it buffered for some reason.
  - tokenizerFactory
```
protected TokenizerFactory<? extends HasWord> tokenizerFactory
```
    Tokenizer used to chop up document text into words.
  - keepOriginalText
```
protected boolean keepOriginalText
```
    Whether to keep source text in document along with tokenized words.
- Constructor Detail
  - DocumentReader
```
public DocumentReader()
```
    Constructs a new DocumentReader without an initial input source. Must call setReader(java.io.Reader) before trying to read any documents. Uses a PTBTokenizer and keeps original text.
  - DocumentReader
```
public DocumentReader(java.io.Reader in)
```
    Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
    
    Parameters:
    
    in - The Reader
  - DocumentReader
```
public DocumentReader(java.io.Reader in,
                      TokenizerFactory<? extends HasWord> tokenizerFactory,
                      boolean keepOriginalText)
```
    Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer. The default implementation will internally buffer the reader if it is not already buffered, so there is no need to pre-wrap the reader with a BufferedReader. This class provides many getReader methods for conviniently reading from many input sources.
- Method Detail
  - getReader
```
public java.io.Reader getReader()
```
    Returns the reader for the text input source of this DocumentReader.
  - setReader
```
public void setReader(java.io.Reader in)
```
    Sets the reader from which to read and create documents. Default implementation automatically buffers the Reader if it's not already buffered. Subclasses that don't want buffering may want to override this method to simply set the global in directly.
  - getTokenizerFactory
```
public TokenizerFactory<? extends HasWord> getTokenizerFactory()
```
    Returns the tokenizer used to chop up text into words for the documents.
  - setTokenizerFactory
```
public void setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
```
    Sets the tokenizer used to chop up text into words for the documents.
  - getKeepOriginalText
```
public boolean getKeepOriginalText()
```
    Returns whether created documents will store their source text along with tokenized words.
  - setKeepOriginalText
```
public void setKeepOriginalText(boolean keepOriginalText)
```
    Sets whether created documents should store their source text along with tokenized words.
  - readDocument
```
public BasicDocument<L> readDocument()
                              throws java.io.IOException
```
    Reads the next document's worth of text from the reader and turns it into a Document. Default implementation calls readNextDocumentText() and passes it to parseDocumentText(java.lang.String) to create the document. Subclasses may wish to override either or both of those methods to handle custom formats of document collections and individual documents respectively. This method can also be overridden in its entirety to provide custom reading and construction of documents from input text.
    
    Throws:
    
    java.io.IOException
  - readNextDocumentText
```
protected java.lang.String readNextDocumentText()
                                         throws java.io.IOException
```
    Reads the next document's worth of text from the reader. Default implementation reads all the text. Subclasses wishing to read multiple documents from a single input source should read until the next document delimiter and return the text so far. Returns null if there is no more text to be read.
    
    Throws:
    
    java.io.IOException
  - parseDocumentText
```
protected BasicDocument<L> parseDocumentText(java.lang.String text)
```
    Creates a new Document for the given text. Default implementation tokenizes the text using the tokenizer provided during construction and sticks the words in a new BasicDocument. The text is also stored as the original text in the BasicDocument if keepOriginalText was set in the constructor. Subclasses may wish to extract additional information from the text and/or return another document subclass with additional meta-data.
  - getBufferedReader
```
public static java.io.BufferedReader getBufferedReader(java.io.Reader in)
```
    Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader. Subclasses should use this method before reading from in for efficiency and/or to read entire lines at a time. Note that this should only be done once per reader because when you read from a buffered reader, it reads more than necessary and stores the rest, so if you then throw that buffered reader out and get a new one for the original reader, text will be missing. In the default DocumentReader text, the Reader passed in at construction is wrapped in a buffered reader so you can just cast in down to a BufferedReader without calling this method.
  - readText
```
public static java.lang.String readText(java.io.Reader in)
                                 throws java.io.IOException
```
    Returns everything that can be read from the given Reader as a String. Returns null if the given Reader is null.
    
    Throws:
    
    java.io.IOException
  - getReader
```
public static java.io.Reader getReader(java.lang.String text)
```
    Returns a Reader that reads in the given text.
  - getReader
```
public static java.io.Reader getReader(java.io.File file)
                                throws java.io.FileNotFoundException
```
    Returns a Reader that reads in the given file.
    
    Throws:
    
    java.io.FileNotFoundException
  - getReader
```
public static java.io.Reader getReader(java.net.URL url)
                                throws java.io.IOException
```
    Returns a Reader that reads in the given URL.
    
    Throws:
    
    java.io.IOException
  - getReader
```
public static java.io.Reader getReader(java.io.InputStream in)
```
    Returns a Reader that reads in the given InputStream.

Class DocumentReader<L>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

in

tokenizerFactory

keepOriginalText

Constructor Detail

DocumentReader

DocumentReader

DocumentReader

Method Detail

getReader

setReader

getTokenizerFactory

setTokenizerFactory

getKeepOriginalText

setKeepOriginalText

readDocument

readNextDocumentText

parseDocumentText

getBufferedReader

readText

getReader

getReader

getReader

getReader