edu.stanford.nlp.process
Class WordToSentenceProcessor

java.lang.Object
  |
  +--edu.stanford.nlp.process.WordToSentenceProcessor
All Implemented Interfaces:
Processor

public class WordToSentenceProcessor
extends Object
implements Processor

Transforms a Document of Words into a Document of Sentences by grouping the Words.

Author:
Joseph Smarr (jsmarr@stanford.edu), Christopher Manning, Teg Grenager (grenager@stanford.edu)

Constructor Summary
WordToSentenceProcessor()
          Creat a WordToSentenceProcessor using a sensible default list of tokens to split on.
WordToSentenceProcessor(Set boundaryTokens)
          Flexibly set the set of acceptable sentence boundary tokens.
WordToSentenceProcessor(Set boundaryTokens, Set boundaryFollowers)
          Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries.
WordToSentenceProcessor(Set boundaryTokens, Set boundaryFollowers, Set boundaryToDiscard)
          Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.
 
Method Summary
static void main(String[] args)
          This will print out as sentences some text.
 Document process(Document words)
          Returns a new Document where each element is a Sentence built from a run of Words in the input Document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordToSentenceProcessor

public WordToSentenceProcessor()
Creat a WordToSentenceProcessor using a sensible default list of tokens to split on.


WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens)
Flexibly set the set of acceptable sentence boundary tokens.


WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens,
                               Set boundaryFollowers)
Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries.


WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens,
                               Set boundaryFollowers,
                               Set boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.

Method Detail

process

public Document process(Document words)
Returns a new Document where each element is a Sentence built from a run of Words in the input Document. Specifically, reads through each word in the input document and breaks off a sentence after finding a valid setence boundary token or end of file. Note that for this to work, the words in the input document must have been tokenized with a tokenizer that makes sentence boundary tokens their own tokens (e.g., PTBTokenizer).

Specified by:
process in interface Processor
See Also:
#sentenceBoundaryTokens, Sentence, PTBTokenizer

main

public static void main(String[] args)
This will print out as sentences some text. It can be used to test sentence division.
Usage: java edu.stanford.nlp.process.WordToSentenceProcessor fileOrUrl

Parameters:
args - Command line argument: a file or URL


Stanford NLP Group