|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.WordToSentenceProcessor<IN>
IN
- The type of the tokens in the sentencespublic class WordToSentenceProcessor<IN>
Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:
<p>
' tag. If two of these follow each other, they are
coalesced: no empty Sentence is output. The end-of-file is not
represented in this Set, but the code behaves as if it were a member.
Constructor Summary | |
---|---|
WordToSentenceProcessor()
Create a WordToSentenceProcessor using a sensible default
list of tokens to split on. |
|
WordToSentenceProcessor(Pattern regionBeginPattern,
Pattern regionEndPattern)
|
|
WordToSentenceProcessor(String boundaryTokens)
Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding). |
|
WordToSentenceProcessor(String boundaryTokens,
Set<String> boundaryFollowers)
Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens. |
|
WordToSentenceProcessor(String boundaryTokens,
Set<String> boundaryFollowers,
Set<String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded. |
Method Summary | ||
---|---|---|
void |
addHtmlSentenceBoundaryToDiscard(Set<String> set)
|
|
boolean |
isOneSentence()
|
|
List<List<IN>> |
process(List<? extends IN> words)
Take a List (including a Sentence) of input, and return a List that has been processed in some way. |
|
|
processDocument(Document<L,F,IN> in)
|
|
void |
setOneSentence(boolean oneSentence)
|
|
void |
setSentenceBoundaryToDiscard(Set<String> set)
|
|
List<List<IN>> |
wordsToSentences(List<? extends IN> words)
Returns a List of Lists where each element is built from a run of Words in the input Document. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public WordToSentenceProcessor()
WordToSentenceProcessor
using a sensible default
list of tokens to split on. The default set is: {".","?","!"} and
any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!
public WordToSentenceProcessor(String boundaryTokens)
boundaryTokens
- The set of boundary tokenspublic WordToSentenceProcessor(String boundaryTokens, Set<String> boundaryFollowers)
public WordToSentenceProcessor(String boundaryTokens, Set<String> boundaryFollowers, Set<String> boundaryToDiscard)
public WordToSentenceProcessor(Pattern regionBeginPattern, Pattern regionEndPattern)
Method Detail |
---|
public void setSentenceBoundaryToDiscard(Set<String> set)
public boolean isOneSentence()
public void setOneSentence(boolean oneSentence)
public void addHtmlSentenceBoundaryToDiscard(Set<String> set)
public List<List<IN>> process(List<? extends IN> words)
ListProcessor
process
in interface ListProcessor<IN,List<IN>>
public List<List<IN>> wordsToSentences(List<? extends IN> words)
PTBTokenizer
).
words
- A list of already tokenized words (must implement HasWord or be a String)
WordToSentenceProcessor(String, Set, Set, Pattern, Pattern)
public <L,F> Document<L,F,List<IN>> processDocument(Document<L,F,IN> in)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |