A tool which accepts raw AnCora-3.0 Spanish XML files and produces
normalized / pre-processed PTB-style treebanks for use with CoreNLP
tools.
This is a substitute for an awkward and complicated string of
command-line invocations. The produced corpus is the standard
treebank which has been used to train the CoreNLP Spanish models.
The preprocessing steps performed here include:
- Expansion and automatic tagging of multi-word tokens (see
MultiWordPreprocessor
,
SpanishTreeNormalizer.normalizeForMultiWord(Tree, TreeFactory)
- Heuristic parsing of expanded multi-word tokens (see
MultiWordTreeExpander
- Splitting of elided forms (
al,
del,
conmigo, etc.) and clitic pronouns from verb forms (see
SpanishTreeNormalizer.expandElisions(Tree)
,
SpanishTreeNormalizer.expandCliticPronouns(Tree)
- Miscellaneous cleanup of parse trees, spelling fixes, parsing
error corrections (see
SpanishTreeNormalizer
)
Apart from raw corpus data, this processor depends upon unigram
part-of-speech tag data. If not provided explicitly to the
processor, the data will be collected from the given files. (You can
pre-compute POS data from AnCora XML using
AnCoraPOSStats
.)
For invocation options, execute the class with no arguments.