edu.stanford.nlp.international.process
Class TreebankPreprocessor
java.lang.Object
edu.stanford.nlp.international.process.TreebankPreprocessor
public final class TreebankPreprocessor
- extends java.lang.Object
A data preparation pipeline for treebanks
A simple framework for preparing various kinds of treebank data. The original goal was to prepare the
Penn Arabic Treebank (PATB) trees for parsing. This pipeline arose from the
need to prepare various data sets in a uniform manner for the execution of experiments that require
multiple tools. The design objectives are:
- Support multiple data input and output types
- Allow parameterization of data sets via a plain text file
- Support rapid, cheap lexical engineering
- End result of processing: a folder with all data sets and a manifest of how the data was prepared
These objectives are realized through three features:
ConfigParser
-- reads the plain text configuration file and creates configuration parameter objects for each data set
Dataset
interface -- Generic interface for loading, processing, and writing datasets
Mapper
interface -- Generic interface for applying transformations to strings (usually words and POS tags)
The process for preparing arbitrary data set X is as follows:
- Add parameters to
ConfigParser
as necessary
- Implement the
Dataset
interface for the new data set (or use one of the existing classes)
- Implement
Mapper
classes as needed
- Specify the data set parameters in a plain text file
- Run
TreebankPreprocessor
using the plain text file as the argument
- Author:
- Spence Green
Field Summary |
static java.util.Map<java.lang.String,java.lang.Integer> |
optionArgDefs
|
Method Summary |
static void |
main(java.lang.String[] args)
Execute with no arguments for usage. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
optionArgDefs
public static final java.util.Map<java.lang.String,java.lang.Integer> optionArgDefs
main
public static void main(java.lang.String[] args)
- Execute with no arguments for usage.
Stanford NLP Group