|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.classify.ColumnDataClassifier
public class ColumnDataClassifier
ColumnDataClassifier provides a command-line interface for doing context-free (independent) classification of a series of data items, where each data item is represented by a line of a file, as a list of String variables, in tab-separated columns. Some features will interpret these variables as numbers, but the code is mainly oriented towards generating features for string classification. To designate a real-valued feature, use the realValued option described below. The classifier can be either a Bernoulli Naive Bayes model or a loglinear discriminative (i.e., maxent) model.
Input files are expected to be one data item per line with two or more columns indicating the class of the item and one or more predictive features. Columns are separated by tab characters. Tab and newline characters cannot occur inside field values (there is no escaping mechanism); any other characters are legal in field values. Typical usage:
java edu.stanford.nlp.classify.ColumnDataClassifier -prop propFile
or
java -mx300m edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile trainFile -testFile testFile -useNGrams|... > output
examples
directory of the distributed
classifier.
In many instances, parameters can either be given on the command line
or provided using a Properties file
(specified on the command-line with -prop
propFile).
Option names are the same as property names with a preceding dash. Boolean
properties can simply be turned on via the command line. Parameters of
types int, String, and double take a following argument after the option.
Command-line parameters can only define features for the first column
describing the datum. If you have multidimensional data, you need to use
a properties file. Property names, as below, are either global (things
like the testFile name) or are seen as properties that define features
for the first data column (we count columns from 0 - unlike the Unix cut
command!). To specify features for a particular data column, precede a
feature by a column number and then a period (for example,
3.wordShape=chris4
). If no number is specified, then the
default interpretation is column 0. Note that in properties files you must
give a value to boolean properties (e.g., 2.useString=true
);
just giving the property name (as 2.useString
) isn't
sufficient.
The following properties are recognized:
Property Name | Type | Default Value | Description | FeatName |
loadClassifier | String | n/a | Path of serialized classifier file to load | |
serializeTo | String | n/a | Path to serialize classifier to | |
printTo | String | n/a | Path to print a text representation of the linear classifier to | |
trainFile | String | n/a | Path of file to use as training data | |
testFile | String | n/a | Path of file to use as test data | |
displayedColumn | int | 1 | Column number that will be printed out to stdout in the output next to the gold class and the chosen class. This is just an aide memoire. If the value is negative, nothing is printed. | |
goldAnswerColumn | int | 0 | Column number that contains the correct class for each data item (again, columns are numbered from 0 up). | |
groupingColumn | int | -1 | Column for grouping multiple data items for the purpose of computing ranking accuracy. This is appropriate when only one datum in a group can be correct, and the intention is to choose the highest probability one, rather than accepting all above a threshold. Multiple items in the same group must be contiguous in the test file (otherwise it would be necessary to cache probabilities and groups for the entire test file to check matches). If it is negative, no grouping column is used, and no ranking accuracy is reported. | |
rankingScoreColumn | int | -1 | If this parameter is non-negative and a groupingColumn is defined, then an average ranking score will be calculated by scoring the chosen candidate from a group according to its value in this column (for instance, the values of this column can be set to a mean reciprocal rank of 1.0 for the best answer, 0.5 for the second best and so on, or the value of this column can be a similarity score reflecting the similarity of the answer to the true answer. | |
rankingAccuracyClass | String | null | If this and groupingColumn are defined (positive), then the system will compute a ranking accuracy under the assumption that there is (at most) one assignment of this class for each group, and ranking accuracy counts the classifier as right if that datum is the one with highest probability according to the model. | |
useString | boolean | false | Gives you a feature for whole string s | S-str |
useClassFeature | boolean | false | Include a feature for the class (as a class marginal) | CLASS |
binnedLengths | String | null | If non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound (if any) up to the next bound (inclusive) are binned (e.g., "1,5,15,30,60"). The feature represents the length of the String in this column. | Len-range |
binnedLengthsStatistics | boolean | false | If true, print to stderr contingency table of statistics for binnedLengths. | |
binnedValues | String | null | If non-null, treat as a sequence of comma separated double bounds, where data items above the previous bound up to the next bound (inclusive) are binned. If a value in this column isn't a legal double , then the value is treated as binnedValuesNaN . | Val-range |
binnedValuesNaN | double | -1.0 | If the value of a numeric binnedValues field is not a number, it will be given this value. | |
binnedValuesStatistics | boolean | false | If true, print to stderr a contingency table of statistics for binnedValues. | |
countChars | String | null | If non-null, count the number of occurrences of each character in the String, and make a feature for each character, binned according to countCharsBins | Char-ch-range |
countCharsBins | String | "0,1" | Treat as a sequence of comma separated integer bounds, where character counts above the previous bound up to and including the next bound are binned. For instance, a value of "0,2" will give 3 bins, dividing a character count into bins of 0, 1-or-2, and 3-or-more occurrences. | |
splitWordsRegexp | String | null | If defined, use this as a regular expression on which to split the whole string (as in the String.split() function, which will return the things between delimiters, and discard the delimiters). The resulting split-up "words" will be used in classifier features iff one of the other "useSplit" options is turned on. | |
splitWordsTokenizerRegexp | String | null | If defined, use this as a regular expression to cut initial pieces off a String. This regular expression should always match the String, and the size of the token is the number of characters matched. So, for example, one can group letter and number characters but do nothing else with a regular expression like ([A-Za-z]+|[0-9]+|.) . (If the regular expression doesn't match, the first character of the string is treated as a one character word, and then matching is tried again, but in this case a warning message is printed.) Note that, for Java regular expressions with disjunctions like this, the match is the first matching disjunction, not the longest matching disjunction, so patterns with common prefixes need to be ordered from most specific (longest) to least specific (shortest).) The resulting split up "words" will be used in classifier features iff one of the other "useSplit" options is turned on. Note that as usual for Java String processing, backslashes must be doubled in the regular expressions that you write. | |
splitWordsIgnoreRegexp | String | null | If defined, this regexp is used to determine character sequences which should not be returned as tokens when using the splitWordsTokenizerRegexp. Typically, these might be whitespace tokens (i.e., \\s+). | |
useSplitWords | boolean | false | Make features from the "words" that are returned by dividing the string on splitWordsRegexp or splitWordsTokenizerRegexp. Requires splitWordsRegexp or splitWordsTokenizerRegexp. | SW-str |
useLowercaseSplitWords | boolean | false | Make features from the "words" that are returned by dividing the string on splitWordsRegexp or splitWordsTokenizerRegexp and then lowercasing the result. Requires splitWordsRegexp or splitWordsTokenizerRegexp. Note that this can be specified independently of useSplitWords. You can put either or both original cased and lowercased words in as features. | SW-str |
useSplitWordPairs | boolean | false | Make features from the pairs of adjacent "words" that are returned by dividing the string into splitWords. Requires splitWordsRegexp or splitWordsTokenizerRegexp. | SWP-str1-str2 |
maxWordNGramLeng | int | -1 | If this number is positive, word n-grams above this size will not be used in the model | |
minWordNGramLeng | int | 1 | Must be positive. word n-grams below this size will not be used in the model | |
wordNGramBoundaryRegexp | String | null | If this is defined and the regexp matches, then the ngram stops | |
useSplitFirstLastWords | boolean | false | Make features from the first and last "words" that are returned as splitWords. Requires splitWordsRegexp or splitWordsTokenizerRegexp. | SFW-str, SLW-str |
useSplitNGrams | boolean | false | Make features from letter n-grams - internal as well as edge all treated the same - after the data string has been split into tokens. Requires splitWordsRegexp or splitWordsTokenizerRegexp. | S#-str |
useSplitPrefixSuffixNGrams | boolean | false | Make features from prefixes and suffixes after splitting with splitWordsRegexp. Requires splitWordsRegexp or splitWordsTokenizerRegexp. | S#B-str, S#E-str |
useNGrams | boolean | false | Make features from letter n-grams - internal as well as edge all treated the same. | #-str |
usePrefixSuffixNGrams | boolean | false | Make features from prefix and suffix strings. | #B-str, #E-str |
lowercase | boolean | false | Make the input string lowercase so all features work unicase | |
lowercaseNGrams | boolean | false | Make features from letter n-grams all lowercase (for both useNGrams and usePrefixSuffixNGrams) | |
maxNGramLeng | int | -1 | If this number is positive, n-grams above this size will not be used in the model | |
minNGramLeng | int | 2 | Must be positive. n-grams below this size will not be used in the model | |
partialNGramRegexp | String | null | If this is defined and the regexp matches, then n-grams are made only from the matching text (if no capturing groups are defined) or from the first capturing group of the regexp, if there is one. This substring is used for both useNGrams and usePrefixSuffixNGrams. | |
realValued | boolean | false | Treat this column as real-valued and do not perform any transforms on the feature value. | Value |
logTransform | boolean | false | Treat this column as real-valued and use the log of the value as the feature value. | Log |
logitTransform | boolean | false | Treat this column as real-valued and use the logit of the value as the feature value. | Logit |
sqrtTransform | boolean | false | Treat this column as real-valued and use the square root of the value as the feature value. | Sqrt |
wordShape | String | none | Either "none" for no wordShape use, or the name of a word shape function recognized by WordShapeClassifier.lookupShaper(String) , such as "dan1" or "chris4". WordShape functions equivalence-class strings based on the pattern of letter, digit, and symbol characters that they contain. The details depend on the particular function chosen. | SHAPE-str |
splitWordShape | String | none | Either "none" for no wordShape or the name of a word shape function recognized by WordShapeClassifier.lookupShaper(String) . This is applied to each "word" found by splitWordsRegexp or splitWordsTokenizerRegexp. | SSHAPE-str |
featureMinimumSupport | int | 0 | A feature, that is, an (observed,class) pair, will only be included in the model providing it is seen a minimum of this number of times in the training data. | |
biasedHyperplane | String | null | If non-null, a sequence of comma-separated pairs of className prob. An item will only be classified to a certain class className if its probability of class membership exceeds the given conditional probability prob; otherwise it will be assigned to a different class. If this list of classes is exhaustive, and no condition is satisfied, then the most probable class is chosen. | |
printFeatures | String | null | Print out the features of the classifier to a file based on this name. | |
printClassifier | String | null | Style in which to print the classifier. One of: HighWeight, HighMagnitude, AllWeights, WeightHistogram, WeightDistribution. See LinearClassifier class for details. | |
printClassifierParam | int | 100 | A parameter to the printing style, which may give, for example the number of parameters to print (for HighWeight or HighMagnitude). | |
justify | boolean | false | For each test data item, print justification (weights) for active features used in classification. | |
exitAfterTrainingFeaturization | boolean | false | If true, the program exits after reading the training data (trainFile) and before building a classifier. This is useful in conjunction with printFeatures, if one only wants to convert data to features for use with another classifier. | |
intern | boolean | false | If true, (String) intern all of the (final) feature names. Recommended (this saves memory, but slows down feature generation in training). | |
cacheNGrams | boolean | false | If true, record the NGram features that correspond to a String (under the current option settings and reuse rather than recalculating if the String is seen again. Disrecommended (speeds training but can require enormous amounts of memory). | |
useNB | boolean | false | Use a Naive Bayes generative classifier (over all features) rather than a discriminative logistic regression classifier. (Set useClass to true to get a prior term.) | |
useBinary | boolean | false | Use the binary classifier (i.e. use LogisticClassifierFactory, rather than LinearClassifierFactory) to get classifier | |
l1reg | double | 0.0 | If set to be larger than 0, uses L1 regularization | |
useAdaptL1 | boolean | false | If true, uses adaptive L1 regularization to find value of l1reg that gives the desired number of features set by limitFeatures | |
l1regmin | double | 0.0 | Minimum L1 in search | |
l1regmax | double | 500.0 | Maximum L1 in search | |
featureWeightThreshold | double | 0.0 | Threshold of model weight at which feature is kept. "Unimportant" low weight features are discarded. (Currently only implemented for adaptL1.) | |
limitFeaturesLabels | String | null | If set, only include features for these labels in the desired number of features | |
limitFeatures | int | 0 | If set to be larger than 0, uses adaptive L1 regularization to find value of l1reg that gives the desired number of features | |
prior | String/int | quadratic | Type of prior (regularization penalty on weights). Possible values are null, "no", "quadratic", "huber", "quartic", "cosh", or "adapt". See LogPrior for more information. | |
useSum | boolean | false | Do optimization via summed conditional likelihood, rather than the product. (This is expensive, non-standard, and somewhat unstable, but can be quite effective: see Klein and Manning 2002 EMNLP paper.) | |
tolerance | double | 1e-4 | Convergence tolerance in parameter optimization | |
sigma | double | 1.0 | A parameter to several of the smoothing (i.e., regularization) methods, usually giving a degree of smoothing as a standard deviation (with small positive values being stronger smoothing, and bigger values weaker smoothing) | |
epsilon | double | 0.01 | Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear | |
useQN | boolean | true | Use Quasi-Newton optimization if true, otherwise use Conjugate Gradient optimization. Recommended. | |
QNsize | int | 15 | Number of previous iterations of Quasi-Newton to store (this increases memory use, but speeds convergence by letting the Quasi-Newton optimization more effectively approximate the second derivative). | |
featureFormat | boolean | false | Assumes the input file isn't text strings but already featurized. One column is treated as the class column (as defined by goldAnswerColumn , and all other columns are treated as features of the instance. (If answers are not present, set goldAnswerColumn to a negative number.) | |
trainFromSVMLight | boolean | false | Assumes the trainFile is in SVMLight format (see SVMLight webpage for more information) | |
testFromSVMLight | boolean | false | Assumes the testFile is in SVMLight format |
Constructor Summary | |
---|---|
ColumnDataClassifier(java.util.Properties props)
Construct a ColumnDataClassifier. |
|
ColumnDataClassifier(java.lang.String filename)
Construct a ColumnDataClassifier. |
Method Summary | |
---|---|
static void |
main(java.lang.String[] args)
Runs the ColumnDataClassifier from the command-line. |
Classifier<java.lang.String,java.lang.String> |
makeClassifier(GeneralDataset<java.lang.String,java.lang.String> train)
Creates a classifier from training data. |
Classifier<java.lang.String,java.lang.String> |
makeClassifierAdaptL1(GeneralDataset<java.lang.String,java.lang.String> train)
Creates a classifier from training data. |
Datum<java.lang.String,java.lang.String> |
makeDatumFromLine(java.lang.String line,
int lineNo)
Entry point for taking a String (formatted as a line of a TSV file) and translating it into a Datum of features. |
static java.lang.String[] |
makeSimpleLineInfo(java.lang.String line,
int lineNo)
|
Pair<GeneralDataset<java.lang.String,java.lang.String>,java.util.List<java.lang.String[]>> |
readTestExamples(java.lang.String filename)
|
GeneralDataset<java.lang.String,java.lang.String> |
readTrainingExamples(java.lang.String fileName)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public ColumnDataClassifier(java.lang.String filename)
filename
- The file with properties which specifies all aspects of behavior.
See the class documentation for details of the properties.public ColumnDataClassifier(java.util.Properties props)
props
- The properties object specifies all aspects of its behavior.
See the class documentation for details of the properties.Method Detail |
---|
public Datum<java.lang.String,java.lang.String> makeDatumFromLine(java.lang.String line, int lineNo)
line
- Line of filelineNo
- The line number. This is just used in error messages if there is an input format problem. You can make it 0.
public static java.lang.String[] makeSimpleLineInfo(java.lang.String line, int lineNo)
public GeneralDataset<java.lang.String,java.lang.String> readTrainingExamples(java.lang.String fileName)
public Pair<GeneralDataset<java.lang.String,java.lang.String>,java.util.List<java.lang.String[]>> readTestExamples(java.lang.String filename)
public Classifier<java.lang.String,java.lang.String> makeClassifierAdaptL1(GeneralDataset<java.lang.String,java.lang.String> train)
train
- training data
public Classifier<java.lang.String,java.lang.String> makeClassifier(GeneralDataset<java.lang.String,java.lang.String> train)
train
- training data
public static void main(java.lang.String[] args) throws java.io.IOException
java edu.stanford.nlp.classify.ColumnDataClassifier -trainFile trainFile
-testFile testFile [-useNGrams|-useString|-sigma sigma|...]
or
java ColumnDataClassifier -prop propFile
args
- Command line arguments, as described in the class
documentation
java.io.IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |