public class ChineseDocumentToSentenceProcessor extends Object implements Serializable
Constructor and Description |
---|
ChineseDocumentToSentenceProcessor() |
ChineseDocumentToSentenceProcessor(String normalizationTableFile) |
Modifier and Type | Method and Description |
---|---|
static List<String> |
fromHTML(String inputString)
Strip off HTML tags before processing.
|
static List<String> |
fromPlainText(String contentString) |
static List<String> |
fromPlainText(String contentString,
boolean segmented) |
static void |
main(String[] args)
usage: java ChineseDocumentToSentenceProcessor [-segmentIBM]
-file filename [-encoding encoding]
|
String |
normalization(String in)
This should now become disused, and other people should call
ChineseUtils directly! CDM June 2006.
|
public ChineseDocumentToSentenceProcessor()
public ChineseDocumentToSentenceProcessor(String normalizationTableFile)
normalizationTableFile
- A file listing character pairs for
normalization. Currently the normalization table must be in UTF-8.
If this parameter is null
, the default normalization
of the zero-argument constructor is used.public String normalization(String in)
public static void main(String[] args) throws IOException
The -segmentIBM option is for IBM GALE-specific splitting of an XML element into sentences.
IOException
public static List<String> fromHTML(String inputString) throws IOException
inputString
- Chinese document text which contains HTML tagsIOException
public static List<String> fromPlainText(String contentString) throws IOException
contentString
- Chinese document textIOException
public static List<String> fromPlainText(String contentString, boolean segmented) throws IOException
IOException