edu.stanford.nlp.ling
Class DocumentReader<L>

java.lang.Object
  extended by edu.stanford.nlp.ling.DocumentReader<L>
Type Parameters:
L - label type

public class DocumentReader<L>
extends java.lang.Object

Basic mechanism for reading in Documents from various input sources. This default implementation can read from strings, files, URLs, and InputStreams and can use a given Tokenizer to turn the text into words. When working with a new data format, make a new DocumentReader to parse it and then use it with the existing Document APIs (rather than having to make new Document classes). Use the protected class variables (in, tokenizer, keepOriginalText) to read text and create docs appropriately. Subclasses should ideally provide similar constructors to this class, though only the constructor that takes a Reader is required.

Author:
Joseph Smarr (jsmarr@stanford.edu), Sarah Spikes (sdspikes@cs.stanford.edu) - templatized

Field Summary
protected  java.io.BufferedReader in
          Reader used to read in document text.
protected  boolean keepOriginalText
          Whether to keep source text in document along with tokenized words.
protected  TokenizerFactory<? extends HasWord> tokenizerFactory
          Tokenizer used to chop up document text into words.
 
Constructor Summary
DocumentReader()
          Constructs a new DocumentReader without an initial input source.
DocumentReader(java.io.Reader in)
          Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
DocumentReader(java.io.Reader in, TokenizerFactory<? extends HasWord> tokenizerFactory, boolean keepOriginalText)
          Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer.
 
Method Summary
static java.io.BufferedReader getBufferedReader(java.io.Reader in)
          Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader.
 boolean getKeepOriginalText()
          Returns whether created documents will store their source text along with tokenized words.
 java.io.Reader getReader()
          Returns the reader for the text input source of this DocumentReader.
static java.io.Reader getReader(java.io.File file)
          Returns a Reader that reads in the given file.
static java.io.Reader getReader(java.io.InputStream in)
          Returns a Reader that reads in the given InputStream.
static java.io.Reader getReader(java.lang.String text)
          Returns a Reader that reads in the given text.
static java.io.Reader getReader(java.net.URL url)
          Returns a Reader that reads in the given URL.
 TokenizerFactory<? extends HasWord> getTokenizerFactory()
          Returns the tokenizer used to chop up text into words for the documents.
protected  BasicDocument<L> parseDocumentText(java.lang.String text)
          Creates a new Document for the given text.
 BasicDocument<L> readDocument()
          Reads the next document's worth of text from the reader and turns it into a Document.
protected  java.lang.String readNextDocumentText()
          Reads the next document's worth of text from the reader.
static java.lang.String readText(java.io.Reader in)
          Returns everything that can be read from the given Reader as a String.
 void setKeepOriginalText(boolean keepOriginalText)
          Sets whether created documents should store their source text along with tokenized words.
 void setReader(java.io.Reader in)
          Sets the reader from which to read and create documents.
 void setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
          Sets the tokenizer used to chop up text into words for the documents.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

in

protected java.io.BufferedReader in
Reader used to read in document text. In default implementation, this is guaranteed to be a BufferedReader (so cast down) but it's typed as Reader in case subclasses don't want it buffered for some reason.


tokenizerFactory

protected TokenizerFactory<? extends HasWord> tokenizerFactory
Tokenizer used to chop up document text into words.


keepOriginalText

protected boolean keepOriginalText
Whether to keep source text in document along with tokenized words.

Constructor Detail

DocumentReader

public DocumentReader()
Constructs a new DocumentReader without an initial input source. Must call setReader(java.io.Reader) before trying to read any documents. Uses a PTBTokenizer and keeps original text.


DocumentReader

public DocumentReader(java.io.Reader in)
Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.

Parameters:
in - The Reader

DocumentReader

public DocumentReader(java.io.Reader in,
                      TokenizerFactory<? extends HasWord> tokenizerFactory,
                      boolean keepOriginalText)
Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer. The default implementation will internally buffer the reader if it is not already buffered, so there is no need to pre-wrap the reader with a BufferedReader. This class provides many getReader methods for conviniently reading from many input sources.

Method Detail

getReader

public java.io.Reader getReader()
Returns the reader for the text input source of this DocumentReader.


setReader

public void setReader(java.io.Reader in)
Sets the reader from which to read and create documents. Default implementation automatically buffers the Reader if it's not already buffered. Subclasses that don't want buffering may want to override this method to simply set the global in directly.


getTokenizerFactory

public TokenizerFactory<? extends HasWord> getTokenizerFactory()
Returns the tokenizer used to chop up text into words for the documents.


setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
Sets the tokenizer used to chop up text into words for the documents.


getKeepOriginalText

public boolean getKeepOriginalText()
Returns whether created documents will store their source text along with tokenized words.


setKeepOriginalText

public void setKeepOriginalText(boolean keepOriginalText)
Sets whether created documents should store their source text along with tokenized words.


readDocument

public BasicDocument<L> readDocument()
                              throws java.io.IOException
Reads the next document's worth of text from the reader and turns it into a Document. Default implementation calls readNextDocumentText() and passes it to parseDocumentText(java.lang.String) to create the document. Subclasses may wish to override either or both of those methods to handle custom formats of document collections and individual documents respectively. This method can also be overridden in its entirety to provide custom reading and construction of documents from input text.

Throws:
java.io.IOException

readNextDocumentText

protected java.lang.String readNextDocumentText()
                                         throws java.io.IOException
Reads the next document's worth of text from the reader. Default implementation reads all the text. Subclasses wishing to read multiple documents from a single input source should read until the next document delimiter and return the text so far. Returns null if there is no more text to be read.

Throws:
java.io.IOException

parseDocumentText

protected BasicDocument<L> parseDocumentText(java.lang.String text)
Creates a new Document for the given text. Default implementation tokenizes the text using the tokenizer provided during construction and sticks the words in a new BasicDocument. The text is also stored as the original text in the BasicDocument if keepOriginalText was set in the constructor. Subclasses may wish to extract additional information from the text and/or return another document subclass with additional meta-data.


getBufferedReader

public static java.io.BufferedReader getBufferedReader(java.io.Reader in)
Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader. Subclasses should use this method before reading from in for efficiency and/or to read entire lines at a time. Note that this should only be done once per reader because when you read from a buffered reader, it reads more than necessary and stores the rest, so if you then throw that buffered reader out and get a new one for the original reader, text will be missing. In the default DocumentReader text, the Reader passed in at construction is wrapped in a buffered reader so you can just cast in down to a BufferedReader without calling this method.


readText

public static java.lang.String readText(java.io.Reader in)
                                 throws java.io.IOException
Returns everything that can be read from the given Reader as a String. Returns null if the given Reader is null.

Throws:
java.io.IOException

getReader

public static java.io.Reader getReader(java.lang.String text)
Returns a Reader that reads in the given text.


getReader

public static java.io.Reader getReader(java.io.File file)
                                throws java.io.FileNotFoundException
Returns a Reader that reads in the given file.

Throws:
java.io.FileNotFoundException

getReader

public static java.io.Reader getReader(java.net.URL url)
                                throws java.io.IOException
Returns a Reader that reads in the given URL.

Throws:
java.io.IOException

getReader

public static java.io.Reader getReader(java.io.InputStream in)
Returns a Reader that reads in the given InputStream.



Stanford NLP Group