edu.stanford.nlp.process
Class WordToSentenceProcessor<IN>

java.lang.Object
  extended by edu.stanford.nlp.process.WordToSentenceProcessor<IN>
Type Parameters:
IN - The type of the tokens in the sentences
All Implemented Interfaces:
ListProcessor<IN,List<IN>>

public class WordToSentenceProcessor<IN>
extends Object
implements ListProcessor<IN,List<IN>>

Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:

See DocumentPreprocessor for a class with a main method that will call this and cut a text file up into sentences.

Author:
Joseph Smarr (jsmarr@stanford.edu), Christopher Manning, Teg Grenager (grenager@stanford.edu), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)

Constructor Summary
WordToSentenceProcessor()
          Create a WordToSentenceProcessor using a sensible default list of tokens to split on.
WordToSentenceProcessor(Pattern regionBeginPattern, Pattern regionEndPattern)
           
WordToSentenceProcessor(Set<String> boundaryTokens)
          Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding).
WordToSentenceProcessor(Set<String> boundaryTokens, Set<String> boundaryFollowers)
          Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens.
WordToSentenceProcessor(Set<String> boundaryTokens, Set<String> boundaryFollowers, Set<String> boundaryToDiscard)
          Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.
 
Method Summary
 void addHtmlSentenceBoundaryToDiscard(Set<String> set)
           
 List<List<IN>> process(List<? extends IN> words)
          Returns a List of Sentences where each element is built from a run of Words in the input Document.
<L,F> Document<L,F,List<IN>>
processDocument(Document<L,F,IN> in)
           
 void setSentenceBoundaryToDiscard(Set<String> set)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordToSentenceProcessor

public WordToSentenceProcessor()
Create a WordToSentenceProcessor using a sensible default list of tokens to split on. The default set is: {".","?","!"}.


WordToSentenceProcessor

public WordToSentenceProcessor(Set<String> boundaryTokens)
Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding). The allowed set of boundary followers is: {")","]","\"","\'", "''", "-RRB-", "-RSB-", "-RCB-"}.

Parameters:
boundaryTokens - The set of boundary tokens

WordToSentenceProcessor

public WordToSentenceProcessor(Set<String> boundaryTokens,
                               Set<String> boundaryFollowers)
Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens. The default set of discarded separator tokens is: {"\n"}.


WordToSentenceProcessor

public WordToSentenceProcessor(Set<String> boundaryTokens,
                               Set<String> boundaryFollowers,
                               Set<String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.


WordToSentenceProcessor

public WordToSentenceProcessor(Pattern regionBeginPattern,
                               Pattern regionEndPattern)
Method Detail

setSentenceBoundaryToDiscard

public void setSentenceBoundaryToDiscard(Set<String> set)

addHtmlSentenceBoundaryToDiscard

public void addHtmlSentenceBoundaryToDiscard(Set<String> set)

process

public List<List<IN>> process(List<? extends IN> words)
Returns a List of Sentences where each element is built from a run of Words in the input Document. Specifically, reads through each word in the input document and breaks off a sentence after finding a valid sentence boundary token or end of file. Note that for this to work, the words in the input document must have been tokenized with a tokenizer that makes sentence boundary tokens their own tokens (e.g., PTBTokenizer).

Specified by:
process in interface ListProcessor<IN,List<IN>>
Parameters:
words - A list of already tokenized words (must implement HasWord or be a String)
Returns:
A list of Sentence
See Also:
WordToSentenceProcessor(Set, Set, Set), Sentence

processDocument

public <L,F> Document<L,F,List<IN>> processDocument(Document<L,F,IN> in)


Stanford NLP Group