edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  extended by edu.stanford.nlp.process.DocumentPreprocessor
All Implemented Interfaces:
Iterable<List<HasWord>>

public class DocumentPreprocessor
extends Object
implements Iterable<List<HasWord>>

Produces a list of sentences from either a plain text or XML document.

Tokenization: The default tokenizer is PTBTokenizer. If null is passed to setTokenizerFactory, then whitespace tokenization is assumed.

Adding a new document type requires two steps:

  1. Add a new DocType.
  2. Create an iterator for the new DocType and modify the iterator() function to return the new iterator.

NOTE: This implementation should not use external libraries since it is used in the parser.

Author:
Spence Green

Nested Class Summary
static class DocumentPreprocessor.DocType
           
 
Constructor Summary
DocumentPreprocessor(Reader input)
          Constructs a preprocessor from an existing input stream.
DocumentPreprocessor(Reader input, DocumentPreprocessor.DocType t)
           
DocumentPreprocessor(String docPath)
          Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL.
DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t)
           
 
Method Summary
 Iterator<List<HasWord>> iterator()
          Returns sentences until the document is exhausted.
static void main(String[] args)
          This provides a simple test method for DocumentPreprocessor.
 void setElementDelimiter(String s)
          Only read text from between these XML tokens if in XML mode.
 void setEncoding(String encoding)
          /** Set the character encoding.
 void setEscaper(Function<List<HasWord>,List<HasWord>> e)
          Set an escaper.
 void setSentenceDelimiter(String s)
          Make the processor assume that the document is already delimited by the supplied parameter.
 void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
          Sets the end-of-sentence delimiters.
 void setTagDelimiter(String s)
          Split POS tags from tokens.
 void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
          Sets the factory from which to produce a Tokenizer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(Reader input)
Constructs a preprocessor from an existing input stream.

Parameters:
input - An existing reader

DocumentPreprocessor

public DocumentPreprocessor(Reader input,
                            DocumentPreprocessor.DocType t)

DocumentPreprocessor

public DocumentPreprocessor(String docPath)
Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL.

Parameters:
docPath -

DocumentPreprocessor

public DocumentPreprocessor(String docPath,
                            DocumentPreprocessor.DocType t)
Method Detail

setEncoding

public void setEncoding(String encoding)
                 throws IllegalCharsetNameException
/** Set the character encoding.

Parameters:
encoding - The character encoding used by Readers
Throws:
IllegalCharsetNameException - If the JVM does not support the named character set.

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters.

Parameters:
sentenceFinalPuncWords -

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.

NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.


setEscaper

public void setEscaper(Function<List<HasWord>,List<HasWord>> e)
Set an escaper.

Parameters:
e - The escaper

setSentenceDelimiter

public void setSentenceDelimiter(String s)
Make the processor assume that the document is already delimited by the supplied parameter.

Parameters:
s - The sentence delimiter

setTagDelimiter

public void setTagDelimiter(String s)
Split POS tags from tokens.

Parameters:
s - POS tag delimiter

setElementDelimiter

public void setElementDelimiter(String s)
Only read text from between these XML tokens if in XML mode. Otherwise, will read from all tokens.


iterator

public Iterator<List<HasWord>> iterator()
Returns sentences until the document is exhausted. Calls close() if the end of the document is reached. Otherwise, the user is required to close the stream.

Specified by:
iterator in interface Iterable<List<HasWord>>

main

public static void main(String[] args)
This provides a simple test method for DocumentPreprocessor.
Usage: java DocumentPreprocessor -file filename [-xml tag] [-suppressEscaping] [-noTokenization]

A filename is required. The code doesn't run as a filter currently.

tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.

Parameters:
args - Command-line arguments


Stanford NLP Group