|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractListProcessor<IN,List<IN>,L,F>
edu.stanford.nlp.process.WordToSentenceProcessor<IN,L,F>
L
- The type of the labelsF
- The type of the featurespublic class WordToSentenceProcessor<IN,L,F>
Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:
<p>
' tag. If two of these follow each other, they are
coalesced: no empty Sentence is output. The end-of-file is not
represented in this Set, but the code behaves as if it were a member.
Constructor Summary | |
---|---|
WordToSentenceProcessor()
Create a WordToSentenceProcessor using a sensible default
list of tokens to split on. |
|
WordToSentenceProcessor(Pattern regionBeginPattern,
Pattern regionEndPattern)
|
|
WordToSentenceProcessor(Set<String> boundaryTokens)
Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens (based on English and Penn Treebank encoding). |
|
WordToSentenceProcessor(Set<String> boundaryTokens,
Set<String> boundaryFollowers)
Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens. |
|
WordToSentenceProcessor(Set<String> boundaryTokens,
Set<String> boundaryFollowers,
Set<String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded. |
Method Summary | |
---|---|
static void |
main(String[] args)
This will print out as sentences some text. |
List<List<IN>> |
process(List<IN> words)
Returns a List of Sentences where each element is built from a run of Words in the input Document. |
Methods inherited from class edu.stanford.nlp.process.AbstractListProcessor |
---|
processDocument, processLists |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public WordToSentenceProcessor()
WordToSentenceProcessor
using a sensible default
list of tokens to split on. The default set is: {".","?","!"}.
public WordToSentenceProcessor(Set<String> boundaryTokens)
boundaryTokens
- The set of boundary tokenspublic WordToSentenceProcessor(Set<String> boundaryTokens, Set<String> boundaryFollowers)
public WordToSentenceProcessor(Set<String> boundaryTokens, Set<String> boundaryFollowers, Set<String> boundaryToDiscard)
public WordToSentenceProcessor(Pattern regionBeginPattern, Pattern regionEndPattern)
Method Detail |
---|
public List<List<IN>> process(List<IN> words)
PTBTokenizer
).
words
- A list of already tokenized words (must implement HasWord or be a String)
WordToSentenceProcessor(Set, Set, Set)
,
Sentence
public static void main(String[] args)
args
- Command line argument: files or URLs
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |