ColumnDataClassifier (Stanford CoreNLP API)

java.lang.Object
- edu.stanford.nlp.classify.ColumnDataClassifier

public class ColumnDataClassifier
extends Object

ColumnDataClassifier provides a command-line interface for doing context-free (independent) classification of a series of data items, where each data item is represented by a line of a file, as a list of String variables, in tab-separated columns. Some features will interpret these variables as numbers, but the code is mainly oriented towards generating features for string classification. To designate a real-valued feature, use the realValued option described below. The classifier can be either a Bernoulli Naive Bayes model or a loglinear discriminative (i.e., maxent) model.

You can also use ColumnDataClassifier programmatically, where its main usefulness beyond simply building your own LinearClassifier is that it provides easy conversion of data items into features, using the same properties as the command-line version.

Input files are expected to be one data item per line with two or more columns indicating the class of the item and one or more predictive features. Columns are separated by tab characters. Tab and newline characters cannot occur inside field values (there is no escaping mechanism); any other characters are legal in field values.

Typical usage:

java edu.stanford.nlp.classify.ColumnDataClassifier -prop propFile

java -mx300m edu.stanford.nlp.classify.ColumnDataClassifier -trainFile trainFile -testFile testFile -useNGrams|... > output

(Note that for large data sets, you may wish to specify the amount of memory available to Java, such as in the second example above.)

In the simplest case, there are just two tab-separated columns in the training input: the first for the class, and the second for the String datum which has that class. In more complex uses, each datum can be multidimensional, and there are many columns of data attributes.

To illustrate simple uses, and the behavior of Naive Bayes and Maximum entropy classifiers, example files corresponding to the examples from the Manning and Klein maxent classifier tutorial, slides 46-49, available at http://nlp.stanford.edu/downloads/classifier.shtml are included in the classify package source directory (files starting with "easy"). Other examples appear in the examples directory of the distributed classifier.

In many instances, parameters can either be given on the command line or provided using a Properties file (specified on the command-line with -prop propFile). Option names are the same as property names with a preceding dash. Boolean properties can simply be turned on via the command line. Parameters of types int, String, and double take a following argument after the option. Command-line parameters can only define features for the first column describing the datum. If you have multidimensional data, you need to use a properties file. Property names, as below, are either global (things like the testFile name) or are seen as properties that define features for the first data column (we count columns from 0 - unlike the Unix cut command!). To specify features for a particular data column, precede a feature by a column number and then a period (for example, 3.wordShape=chris4). If no number is specified, then the default interpretation is column 0. Note that in properties files you must give a value to boolean properties (e.g., 2.useString=true); just giving the property name (as 2.useString) isn't sufficient.

The following properties are recognized:

Property Name	Type	Default Value	Description	FeatName
loadClassifier	String	n/a	Path of serialized classifier file to load
serializeTo	String	n/a	Path to serialize classifier to
printTo	String	n/a	Path to print a text representation of the linear classifier to
trainFile	String	n/a	Path of file to use as training data
testFile	String	n/a	Path of file to use as test data
encoding	String	utf-8	Character encoding of training and test file, e.g., utf-8, GB18030, or iso-8859-1
displayedColumn	int	1	Column number that will be printed out to stdout in the output next to the gold class and the chosen class. This is just an aide memoire. If the value is negative, nothing is printed.
displayAllAnswers	boolean	false	If true, print all classes and their probability, sorted by probability, rather than just the highest scoring and correct classes.
goldAnswerColumn	int	0	Column number that contains the correct class for each data item (again, columns are numbered from 0 up).
groupingColumn	int	-1	Column for grouping multiple data items for the purpose of computing ranking accuracy. This is appropriate when only one datum in a group can be correct, and the intention is to choose the highest probability one, rather than accepting all above a threshold. Multiple items in the same group must be contiguous in the test file (otherwise it would be necessary to cache probabilities and groups for the entire test file to check matches). If it is negative, no grouping column is used, and no ranking accuracy is reported.
rankingScoreColumn	int	-1	If this parameter is non-negative and a groupingColumn is defined, then an average ranking score will be calculated by scoring the chosen candidate from a group according to its value in this column (for instance, the values of this column can be set to a mean reciprocal rank of 1.0 for the best answer, 0.5 for the second best and so on, or the value of this column can be a similarity score reflecting the similarity of the answer to the true answer.
rankingAccuracyClass	String	null	If this and groupingColumn are defined (positive), then the system will compute a ranking accuracy under the assumption that there is (at most) one assignment of this class for each group, and ranking accuracy counts the classifier as right if that datum is the one with highest probability according to the model.
useString	boolean	false	Gives you a feature for whole string s	S-str
useClassFeature	boolean	false	Include a feature for the class (as a class marginal)	CLASS
binnedLengths	String	null	If non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound (if any) up to the next bound (inclusive) are binned (e.g., "1,5,15,30,60"). The feature represents the length of the String in this column.	Len-range
binnedLengthsStatistics	boolean	false	If true, print to stderr contingency table of statistics for binnedLengths.
binnedValues	String	null	If non-null, treat as a sequence of comma separated double bounds, where data items above the previous bound up to the next bound (inclusive) are binned. If a value in this column isn't a legal `double`, then the value is treated as `binnedValuesNaN`.	Val-range
binnedValuesNaN	double	-1.0	If the value of a numeric binnedValues field is not a number, it will be given this value.
binnedValuesStatistics	boolean	false	If true, print to stderr a contingency table of statistics for binnedValues.
countChars	String	null	If non-null, count the number of occurrences of each character in the String, and make a feature for each character, binned according to `countCharsBins`	Char-ch-range
countCharsBins	String	"0,1"	Treat as a sequence of comma separated integer bounds, where character counts above the previous bound up to and including the next bound are binned. For instance, a value of "0,2" will give 3 bins, dividing a character count into bins of 0, 1-or-2, and 3-or-more occurrences.
splitWordsRegexp	String	null	If defined, use this as a regular expression on which to split the whole string (as in the String.split() function, which will return the things between delimiters, and discard the delimiters). The resulting split-up "words" will be used in classifier features iff one of the other "useSplit" options is turned on.
splitWordsTokenizerRegexp	String	null	If defined, use this as a regular expression to cut initial pieces off a String. Either this regular expression or `splitWordsIgnoreRegexp` should always match the start of the String, and the size of the token is the number of characters matched. So, for example, one can group letter and number characters but do nothing else with a regular expression like `([A-Za-z]+\|[0-9]+\|.)`, where the last disjunct will match any other single character. (If neither regular expression matches, the first character of the string is treated as a one character word, and then matching is tried again, but in this case a warning message is printed.) Note that, for Java regular expressions with disjunctions like this, the match is the first matching disjunction, not the longest matching disjunction, so patterns with common prefixes need to be ordered from most specific (longest) to least specific (shortest).) The resulting split up "words" will be used in classifier features iff one of the other "useSplit" options is turned on. Note that as usual for Java String processing, backslashes must be doubled in the regular expressions that you write.
splitWordsIgnoreRegexp	String	\\s+	If non-empty, this regexp is used to determine character sequences which should not be returned as tokens when using `splitWordsTokenizerRegexp` or `splitWordsRegexp`. With the former, first the program attempts to match this regular expression at the start of the string (with `lookingAt()`) and if it matches, those characters are discarded, but if it doesn't match then `splitWordsTokenizerRegexp` is tried. With `splitWordsRegexp`, this is used to filter tokens (with `matches()` resulting from the splitting. By default this regular expression is set to be all whitespace tokens (i.e., \\s+). Set it to an empty string to get all tokens returned.
splitWordsWithPTBTokenizer	boolean	false	If true, and `splitWordsRegexp` and `splitWordsTokenizerRegexp` are false, then will tokenize using the `PTBTokenizer`
useSplitWords	boolean	false	Make features from the "words" that are returned by dividing the string on splitWordsRegexp or splitWordsTokenizerRegexp. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	SW-str
useLowercaseSplitWords	boolean	false	Make features from the "words" that are returned by dividing the string on splitWordsRegexp or splitWordsTokenizerRegexp and then lowercasing the result. Requires splitWordsRegexp or splitWordsTokenizerRegexp. Note that this can be specified independently of useSplitWords. You can put either or both original cased and lowercased words in as features.	SW-str
useSplitWordPairs	boolean	false	Make features from the pairs of adjacent "words" that are returned by dividing the string into splitWords. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	SWP-str1-str2
useAllSplitWordPairs	boolean	false	Make features from all pairs of "words" that are returned by dividing the string into splitWords. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	ASWP-str1-str2
useAllSplitWordTriples	boolean	false	Make features from all triples of "words" that are returned by dividing the string into splitWords. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	ASWT-str1-str2-str3
useSplitWordNGrams	boolean	false	Make features of adjacent word n-grams of lengths between minWordNGramLeng and maxWordNGramLeng inclusive. Note that these are word sequences, not character n-grams.	SW#-str1-str2-strN
maxWordNGramLeng	int	-1	If this number is positive, word n-grams above this size will not be used in the model
minWordNGramLeng	int	1	Must be positive. word n-grams below this size will not be used in the model
wordNGramBoundaryRegexp	String	null	If this is defined and the regexp matches, then the ngram stops
useSplitFirstLastWords	boolean	false	Make a feature from each of the first and last "words" that are returned as splitWords. This is equivalent to having word bigrams with boundary tokens at each end of the sequence (they get a special feature). Requires splitWordsRegexp or splitWordsTokenizerRegexp.	SFW-str, SLW-str
useSplitNGrams	boolean	false	Make features from letter n-grams - internal as well as edge all treated the same - after the data string has been split into tokens. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	S#-str
useSplitPrefixSuffixNGrams	boolean	false	Make features from prefixes and suffixes of each token, after splitting string with splitWordsRegexp. Requires splitWordsRegexp or splitWordsTokenizerRegexp.	S#B-str, S#E-str
useNGrams	boolean	false	Make features from letter n-grams - internal as well as edge all treated the same.	#-str
usePrefixSuffixNGrams	boolean	false	Make features from prefix and suffix substrings of the string.	#B-str, #E-str
lowercase	boolean	false	Make the input string lowercase so all features work uncased
lowercaseNGrams	boolean	false	Make features from letter n-grams all lowercase (for all of useNGrams, usePrefixSuffixNGrams, useSplitNGrams, and useSplitPrefixSuffixNGrams)
maxNGramLeng	int	-1	If this number is positive, n-grams above this size will not be used in the model
minNGramLeng	int	2	Must be positive. n-grams below this size will not be used in the model
partialNGramRegexp	String	null	If this is defined and the regexp matches, then n-grams are made only from the matching text (if no capturing groups are defined) or from the first capturing group of the regexp, if there is one. This substring is used for both useNGrams and usePrefixSuffixNGrams.
realValued	boolean	false	Treat this column as real-valued and do not perform any transforms on the feature value.	Value
logTransform	boolean	false	Treat this column as real-valued and use the log of the value as the feature value.	Log
logitTransform	boolean	false	Treat this column as real-valued and use the logit of the value as the feature value.	Logit
sqrtTransform	boolean	false	Treat this column as real-valued and use the square root of the value as the feature value.	Sqrt
filename	boolean	false	Treat this column as a filename (path) and then use the contents of that file (assumed to be plain text) in the calculation of features according to other flag specifications.
wordShape	String	none	Either "none" for no wordShape use, or the name of a word shape function recognized by `WordShapeClassifier.lookupShaper(String)`, such as "dan1" or "chris4". WordShape functions equivalence-class strings based on the pattern of letter, digit, and symbol characters that they contain. The details depend on the particular function chosen.	SHAPE-str
splitWordShape	String	none	Either "none" for no wordShape or the name of a word shape function recognized by `WordShapeClassifier.lookupShaper(String)`. This is applied to each "word" found by splitWordsRegexp or splitWordsTokenizerRegexp.	SSHAPE-str
featureMinimumSupport	int	0	A feature, that is, an (observed,class) pair, will only be included in the model providing it is seen a minimum of this number of times in the training data.
biasedHyperplane	String	null	If non-null, a sequence of comma-separated pairs of className prob. An item will only be classified to a certain class className if its probability of class membership exceeds the given conditional probability prob; otherwise it will be assigned to a different class. If this list of classes is exhaustive, and no condition is satisfied, then the most probable class is chosen.
printFeatures	String	null	Print out the features and their values for each instance to a file based on this name.
printClassifier	String	null	Style in which to print the classifier. One of: HighWeight, HighMagnitude, AllWeights, WeightHistogram, WeightDistribution. See LinearClassifier class for details.
printClassifierParam	int	100	A parameter to the printing style, which may give, for example the number of parameters to print (for HighWeight or HighMagnitude).
justify	boolean	false	For each test data item, print justification (weights) for active features used in classification.
exitAfterTrainingFeaturization	boolean	false	If true, the program exits after reading the training data (trainFile) and before building a classifier. This is useful in conjunction with printFeatures, if one only wants to convert data to features for use with another classifier.
intern	boolean	false	If true, (String) intern all of the (final) feature names. Recommended (this saves memory, but slows down feature generation in training).
cacheNGrams	boolean	false	If true, record the NGram features that correspond to a String (under the current option settings and reuse rather than recalculating if the String is seen again. Disrecommended (speeds training but can require enormous amounts of memory).
useNB	boolean	false	Use a Naive Bayes generative classifier (over all features) rather than a discriminative logistic regression classifier. (Set `useClass` to true to get a prior term.)
useBinary	boolean	false	Use the binary classifier (i.e. use LogisticClassifierFactory, rather than LinearClassifierFactory) to get classifier
l1reg	double	0.0	If set to be larger than 0, uses L1 regularization
useAdaptL1	boolean	false	If true, uses adaptive L1 regularization to find value of l1reg that gives the desired number of features set by limitFeatures
l1regmin	double	0.0	Minimum L1 in search
l1regmax	double	500.0	Maximum L1 in search
featureWeightThreshold	double	0.0	Threshold of model weight at which feature is kept. "Unimportant" low weight features are discarded. (Currently only implemented for adaptL1.)
limitFeaturesLabels	String	null	If set, only include features for these labels in the desired number of features
limitFeatures	int	0	If set to be larger than 0, uses adaptive L1 regularization to find value of l1reg that gives the desired number of features
prior	String/int	quadratic	Type of prior (regularization penalty on weights). Possible values are null, "no", "quadratic", "huber", "quartic", "cosh", or "adapt". See `LogPrior` for more information.
useSum	boolean	false	Do optimization via summed conditional likelihood, rather than the product. (This is expensive, non-standard, and somewhat unstable, but can be quite effective: see Klein and Manning 2002 EMNLP paper.)
tolerance	double	1e-4	Convergence tolerance in parameter optimization
sigma	double	1.0	A parameter to several of the smoothing (i.e., regularization) methods, usually giving a degree of smoothing as a standard deviation (with small positive values being stronger smoothing, and bigger values weaker smoothing). However, for Naive Bayes models it is the amount of add-sigma smoothing, so a bigger number is more smoothing.
epsilon	double	0.01	Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear
useQN	boolean	true	Use Quasi-Newton optimization if true, otherwise use Conjugate Gradient optimization. Recommended.
QNsize	int	15	Number of previous iterations of Quasi-Newton to store (this increases memory use, but speeds convergence by letting the Quasi-Newton optimization more effectively approximate the second derivative).
featureFormat	boolean	false	Assumes the input file isn't text strings but already featurized. One column is treated as the class column (as defined by `goldAnswerColumn`, and all other columns are treated as features of the instance. (If answers are not present, set `goldAnswerColumn` to a negative number.)
trainFromSVMLight	boolean	false	Assumes the trainFile is in SVMLight format (see SVMLight web page for more information)
testFromSVMLight	boolean	false	Assumes the testFile is in SVMLight format
printSVMLightFormatTo	String	null	If non-null, print the featurized training data to an SVMLight format file (usually used with exitAfterTrainingFeaturization). This is just an option to write out data in a particular format. After that, you're on your own using some other piece of software that reads SVMlight format files.
crossValidationFolds	int	-1	If positive, the training data is divided in to this many folds and cross-validation is done on the training data (prior to testing on test data, if it is also specified)
shuffleTrainingData	boolean	false	If true, the training data is shuffled prior to training and cross-validation. This is vital in cross-validation if the training data is otherwise sorted by class.
shuffleSeed	long	0	If non-zero, and the training data is being shuffled, this is used as the seed for the Random. Otherwise, System.nanoTime() is used.
csvFormat	boolean	false	If true, reads train and test file in csv format, with support for quoted fields.

Author:: Christopher Manning, Anna Rafferty, Angel Chang (add options for using l1reg)

Constructor Summary

Constructors
Constructor and Description

ColumnDataClassifier(Properties props)
Construct a ColumnDataClassifier.

ColumnDataClassifier(String filename)
Construct a ColumnDataClassifier.

Constructors
Constructor and Description
`ColumnDataClassifier(Properties props)` Construct a ColumnDataClassifier.
`ColumnDataClassifier(String filename)` Construct a ColumnDataClassifier.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`classOf(Datum<String,String> example)`
`Pair<Double,Double>`	`crossValidate(GeneralDataset<String,String> dataset, List<String[]> lineInfos)` Run cross-validation on a dataset, and return accuracy and macro-F1 scores.
`static void`	`main(String[] args)` Runs the ColumnDataClassifier from the command-line.
`Classifier<String,String>`	`makeClassifier(GeneralDataset<String,String> train)` Creates a classifier from training data.
`Datum<String,String>`	`makeDatumFromLine(String line)` Entry point for taking a String (formatted as a line of a TSV file) and translating it into a Datum of features.
`Datum<String,String>`	`makeDatumFromStrings(String[] strings)` Takes a String[] of elements and translates them into a Datum of features.
`Pair<GeneralDataset<String,String>,List<String[]>>`	`readAndReturnTrainingExamples(String fileName)` Read a set of training examples from a file, and return the data in a featurized form and in String form.
`Pair<GeneralDataset<String,String>,List<String[]>>`	`readTestExamples(String filename)` Read a data set from a file at test time, and return it.
`GeneralDataset<String,String>`	`readTrainingExamples(String fileName)` Read a set of training examples from a file, and return the data in a featurized form.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ColumnDataClassifier
```
public ColumnDataClassifier(String filename)
```
    Construct a ColumnDataClassifier.
    
    Parameters:
    
    filename - The file with properties which specifies all aspects of behavior. See the class documentation for details of the properties.
  - ColumnDataClassifier
```
public ColumnDataClassifier(Properties props)
```
    Construct a ColumnDataClassifier.
    
    Parameters:
    
    props - The properties object specifies all aspects of its behavior. See the class documentation for details of the properties.
- Method Detail
  - makeDatumFromLine
```
public Datum<String,String> makeDatumFromLine(String line)
```
    Entry point for taking a String (formatted as a line of a TSV file) and translating it into a Datum of features. If real-valued features are used, this method returns an RVFDatum; otherwise, categorical features are used.
    
    Parameters:
    
    line - Line of file
    
    Returns:
    
    A Datum (may be an RVFDatum; never null)
  - makeDatumFromStrings
```
public Datum<String,String> makeDatumFromStrings(String[] strings)
```
    Takes a String[] of elements and translates them into a Datum of features. If real-valued features are used, this method accesses makeRVFDatumFromLine and returns an RVFDatum; otherwise, categorical features are used.
    
    Parameters:
    
    strings - The elements that features are made from (the columns of a TSV/CSV file)
    
    Returns:
    
    A Datum (may be an RVFDatum; never null)
  - readTrainingExamples
```
public GeneralDataset<String,String> readTrainingExamples(String fileName)
```
    Read a set of training examples from a file, and return the data in a featurized form. If feature selection is asked for, the returned featurized form is after feature selection has been applied.
    
    Parameters:
    
    fileName - File with supervised training examples.
    
    Returns:
    
    A GeneralDataset, where the labels and features are Strings.
  - readAndReturnTrainingExamples
```
public Pair<GeneralDataset<String,String>,List<String[]>> readAndReturnTrainingExamples(String fileName)
```
    Read a set of training examples from a file, and return the data in a featurized form and in String form. If feature selection is asked for, the returned featurized form is after feature selection has been applied.
    
    Parameters:
    
    fileName - File with supervised training examples.
    
    Returns:
    
    A Pair of a GeneralDataset, where the labels and features are Strings and a List of the input examples
  - readTestExamples
```
public Pair<GeneralDataset<String,String>,List<String[]>> readTestExamples(String filename)
```
    Read a data set from a file at test time, and return it.
    
    Parameters:
    
    filename - The file to read the examples from.
    
    Returns:
    
    A Pair. The first item of the pair is the featurized data set, ready for passing to the classifier. The second item of the pair is simply each line of the file split into tab-separated columns. This is at present necessary for the built-in evaluation, which uses the gold class from here, and may also be helpful when wanting to print extra output about the classification process.
  - makeClassifier
```
public Classifier<String,String> makeClassifier(GeneralDataset<String,String> train)
```
    Creates a classifier from training data.
    
    Parameters:
    
    train - training data
    
    Returns:
    
    trained classifier
  - main
```
public static void main(String[] args)
                 throws IOException
```
    Runs the ColumnDataClassifier from the command-line. Usage:
    java edu.stanford.nlp.classify.ColumnDataClassifier -trainFile trainFile -testFile testFile [-useNGrams|-useString|-sigma sigma|...]
    or
    java ColumnDataClassifier -prop propFile
    
    Parameters:
    
    args - Command line arguments, as described in the class documentation
    
    Throws:
    
    IOException - If IO problems
  - crossValidate
```
public Pair<Double,Double> crossValidate(GeneralDataset<String,String> dataset,
                                         List<String[]> lineInfos)
```
    Run cross-validation on a dataset, and return accuracy and macro-F1 scores. The number of folds is given by the crossValidationFolds property.
    
    Parameters:
    
    dataset - The dataset of examples to cross-validate on.
    
    lineInfos - The String form of the items in the dataset. (Must be present.)
    
    Returns:
    
    Accuracy and macro F1
  - classOf
```
public String classOf(Datum<String,String> example)
```

Class ColumnDataClassifier

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ColumnDataClassifier

ColumnDataClassifier

Method Detail

makeDatumFromLine

makeDatumFromStrings

readTrainingExamples

readAndReturnTrainingExamples

readTestExamples

makeClassifier

main

crossValidate

classOf