edu.stanford.nlp.process
Class WordToSentenceProcessor<IN>

java.lang.Object
  extended by edu.stanford.nlp.process.WordToSentenceProcessor<IN>
Type Parameters:
IN - The type of the tokens in the sentences
All Implemented Interfaces:
ListProcessor<IN,List<IN>>

public class WordToSentenceProcessor<IN>
extends Object
implements ListProcessor<IN,List<IN>>

Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:

See DocumentPreprocessor for a class with a main method that will call this and cut a text file up into sentences.

Author:
Joseph Smarr (jsmarr@stanford.edu), Christopher Manning, Teg Grenager (grenager@stanford.edu), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)

Constructor Summary
WordToSentenceProcessor()
          Create a WordToSentenceProcessor using a sensible default list of tokens to split on.
WordToSentenceProcessor(Pattern regionBeginPattern, Pattern regionEndPattern)
           
WordToSentenceProcessor(String boundaryTokenRegex)
          Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding).
WordToSentenceProcessor(String boundaryTokenRegex, Set<String> boundaryFollowers)
          Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens.
WordToSentenceProcessor(String boundaryTokenRegex, Set<String> boundaryFollowers, Set<String> boundaryToDiscard)
          Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.
 
Method Summary
 void addHtmlSentenceBoundaryToDiscard(Set<String> set)
           
 boolean isOneSentence()
           
 List<List<IN>> process(List<? extends IN> words)
          Take a List (including a Sentence) of input, and return a List that has been processed in some way.
<L,F> Document<L,F,List<IN>>
processDocument(Document<L,F,IN> in)
           
 void setOneSentence(boolean oneSentence)
           
 void setSentenceBoundaryToDiscard(Set<String> regexSet)
           
 List<List<IN>> wordsToSentences(List<? extends IN> words)
          Returns a List of Lists where each element is built from a run of Words in the input Document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordToSentenceProcessor

public WordToSentenceProcessor()
Create a WordToSentenceProcessor using a sensible default list of tokens to split on. The default set is: {".","?","!"} and any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!


WordToSentenceProcessor

public WordToSentenceProcessor(String boundaryTokenRegex)
Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding). The allowed set of boundary followers is: {")","]","\"","\'", "''", "-RRB-", "-RSB-", "-RCB-"}.

Parameters:
boundaryTokenRegex - The set of boundary tokens

WordToSentenceProcessor

public WordToSentenceProcessor(String boundaryTokenRegex,
                               Set<String> boundaryFollowers)
Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens. The default set of discarded separator tokens is: {"\n"}.


WordToSentenceProcessor

public WordToSentenceProcessor(String boundaryTokenRegex,
                               Set<String> boundaryFollowers,
                               Set<String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.


WordToSentenceProcessor

public WordToSentenceProcessor(Pattern regionBeginPattern,
                               Pattern regionEndPattern)
Method Detail

setSentenceBoundaryToDiscard

public void setSentenceBoundaryToDiscard(Set<String> regexSet)

isOneSentence

public boolean isOneSentence()

setOneSentence

public void setOneSentence(boolean oneSentence)

addHtmlSentenceBoundaryToDiscard

public void addHtmlSentenceBoundaryToDiscard(Set<String> set)

process

public List<List<IN>> process(List<? extends IN> words)
Description copied from interface: ListProcessor
Take a List (including a Sentence) of input, and return a List that has been processed in some way.

Specified by:
process in interface ListProcessor<IN,List<IN>>

wordsToSentences

public List<List<IN>> wordsToSentences(List<? extends IN> words)
Returns a List of Lists where each element is built from a run of Words in the input Document. Specifically, reads through each word in the input document and breaks off a sentence after finding a valid sentence boundary token or end of file. Note that for this to work, the words in the input document must have been tokenized with a tokenizer that makes sentence boundary tokens their own tokens (e.g., PTBTokenizer).

Parameters:
words - A list of already tokenized words (must implement HasWord or be a String)
Returns:
A list of Sentence
See Also:
WordToSentenceProcessor(String, Set, Set, Pattern, Pattern)

processDocument

public <L,F> Document<L,F,List<IN>> processDocument(Document<L,F,IN> in)


Stanford NLP Group