|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.WordToSentenceProcessor<IN>
IN
- The type of the tokens in the sentencespublic class WordToSentenceProcessor<IN>
Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:
<p>
' tag. If two of these follow each other, they are
coalesced: no empty Sentence is output. The end-of-file is not
represented in this Set, but the code behaves as if it were a member.
Field Summary | |
---|---|
static java.util.Set<java.lang.String> |
DEFAULT_BOUNDARY_FOLLOWERS
|
static java.util.Set<java.lang.String> |
DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
|
Constructor Summary | |
---|---|
WordToSentenceProcessor()
Create a WordToSentenceProcessor using a sensible default
list of tokens to split on for English/Latin writing systems. |
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex)
Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens and sentence boundary to discard tokens (based on English and Penn Treebank encoding). |
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
java.util.Set<java.lang.String> boundaryFollowers,
java.util.Set<java.lang.String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded. |
Method Summary | ||
---|---|---|
void |
addHtmlSentenceBoundaryToDiscard(java.util.Set<java.lang.String> set)
|
|
boolean |
allowEmptySentences()
|
|
boolean |
isOneSentence()
|
|
java.util.List<java.util.List<IN>> |
process(java.util.List<? extends IN> words)
Take a List (including a Sentence) of input, and return a List that has been processed in some way. |
|
|
processDocument(Document<L,F,IN> in)
|
|
void |
setAllowEmptySentences(boolean allowEmptySentences)
|
|
void |
setOneSentence(boolean oneSentence)
|
|
void |
setSentenceBoundaryToDiscard(java.util.Set<java.lang.String> regexSet)
|
|
java.util.List<java.util.List<IN>> |
wordsToSentences(java.util.List<? extends IN> words)
Returns a List of Lists where each element is built from a run of Words in the input Document. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.util.Set<java.lang.String> DEFAULT_BOUNDARY_FOLLOWERS
public static final java.util.Set<java.lang.String> DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
Constructor Detail |
---|
public WordToSentenceProcessor()
WordToSentenceProcessor
using a sensible default
list of tokens to split on for English/Latin writing systems.
The default set is: {".","?","!"} and
any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.
public WordToSentenceProcessor(java.lang.String boundaryTokenRegex)
boundaryTokenRegex
- The set of boundary tokenspublic WordToSentenceProcessor(java.lang.String boundaryTokenRegex, java.util.Set<java.lang.String> boundaryFollowers, java.util.Set<java.lang.String> boundaryToDiscard)
Method Detail |
---|
public void setSentenceBoundaryToDiscard(java.util.Set<java.lang.String> regexSet)
public boolean isOneSentence()
public void setOneSentence(boolean oneSentence)
public boolean allowEmptySentences()
public void setAllowEmptySentences(boolean allowEmptySentences)
public void addHtmlSentenceBoundaryToDiscard(java.util.Set<java.lang.String> set)
public java.util.List<java.util.List<IN>> process(java.util.List<? extends IN> words)
ListProcessor
process
in interface ListProcessor<IN,java.util.List<IN>>
public java.util.List<java.util.List<IN>> wordsToSentences(java.util.List<? extends IN> words)
PTBTokenizer
).
words
- A list of already tokenized words (must implement HasWord or be a String)
WordToSentenceProcessor(String, Set, Set, Pattern, Pattern)
public <L,F> Document<L,F,java.util.List<IN>> processDocument(Document<L,F,IN> in)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |