public class ChineseDocumentToSentenceProcessor extends Object implements Serializable
| Constructor and Description |
|---|
ChineseDocumentToSentenceProcessor() |
ChineseDocumentToSentenceProcessor(String normalizationTableFile) |
| Modifier and Type | Method and Description |
|---|---|
static List<String> |
fromHTML(String inputString)
Strip off HTML tags before processing.
|
static List<String> |
fromPlainText(String contentString) |
static List<String> |
fromPlainText(String contentString,
boolean segmented) |
static void |
main(String[] args)
usage: java ChineseDocumentToSentenceProcessor [-segmentIBM]
-file filename [-encoding encoding]
|
String |
normalization(String in)
This should now become disused, and other people should call
ChineseUtils directly! CDM June 2006.
|
public ChineseDocumentToSentenceProcessor()
public ChineseDocumentToSentenceProcessor(String normalizationTableFile)
normalizationTableFile - A file listing character pairs for
normalization. Currently the normalization table must be in UTF-8.
If this parameter is null, the default normalization
of the zero-argument constructor is used.public String normalization(String in)
public static void main(String[] args) throws IOException
The -segmentIBM option is for IBM GALE-specific splitting of an XML element into sentences.
IOExceptionpublic static List<String> fromHTML(String inputString) throws IOException
inputString - Chinese document text which contains HTML tagsIOExceptionpublic static List<String> fromPlainText(String contentString) throws IOException
contentString - Chinese document textIOExceptionpublic static List<String> fromPlainText(String contentString, boolean segmented) throws IOException
IOException