edu.stanford.nlp.classify
Class RVFDataset<L,F>

java.lang.Object
  extended by edu.stanford.nlp.classify.GeneralDataset<L,F>
      extended by edu.stanford.nlp.classify.RVFDataset<L,F>
Type Parameters:
L - The type of the labels in the Dataset
F - The type of the features in the Dataset
All Implemented Interfaces:
java.io.Serializable, java.lang.Iterable<RVFDatum<L,F>>

public class RVFDataset<L,F>
extends GeneralDataset<L,F>
implements java.lang.Iterable<RVFDatum<L,F>>

An interfacing class for ClassifierFactory that incrementally builds a more memory-efficient representation of a List of RVFDatum objects for the purposes of training a Classifier with a ClassifierFactory.

Author:
Jenny Finkel (jrfinkel@stanford.edu), Rajat Raina (added methods to record data sources and ids), Anna Rafferty (various refactoring with GeneralDataset/Dataset), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)
See Also:
Serialized Form

Field Summary
 
Fields inherited from class edu.stanford.nlp.classify.GeneralDataset
data, featureIndex, labelIndex, labels, size
 
Constructor Summary
RVFDataset()
           
RVFDataset(Index<F> featureIndex, Index<L> labelIndex)
           
RVFDataset(Index<L> labelIndex, int[] labels, Index<F> featureIndex, int[][] data, double[][] values)
          Constructor that fully specifies a Dataset.
RVFDataset(int numDatums)
           
RVFDataset(int numDatums, Index<F> featureIndex, Index<L> labelIndex)
           
 
Method Summary
 void add(Datum<L,F> d)
           
 void add(Datum<L,F> d, java.lang.String src, java.lang.String id)
           
 void applyFeatureCountThreshold(int k)
          Applies a feature count threshold to the RVFDataset.
 void applyFeatureMaxCountThreshold(int k)
          Applies a feature max count threshold to the RVFDataset.
 void clear()
          Resets the Dataset so that it is empty and ready to collect data.
 void clear(int numDatums)
          Resets the Dataset so that it is empty and ready to collect data.
 void ensureRealValues()
          checks if the dataset has any unbounded values.
 RVFDatum<L,F> getDatum(int index)
           
 RVFDatum<L,F> getRVFDatum(int index)
           
 java.lang.String getRVFDatumId(int index)
           
 java.lang.String getRVFDatumSource(int index)
           
 double[][] getValuesArray()
           
protected  void initialize(int numDatums)
          This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data
 java.util.Iterator<RVFDatum<L,F>> iterator()
          
static void main(java.lang.String[] args)
           
 void printFullFeatureMatrix(java.io.PrintWriter pw)
          prints the full feature matrix in tab-delimited form.
 void printFullFeatureMatrixWithValues(java.io.PrintWriter pw)
          Modification of printFullFeatureMatrix to correct bugs & print values (Rajat).
 void printSparseFeatureMatrix()
          Prints the sparse feature matrix using printSparseFeatureMatrix(PrintWriter) to System.out.
 void printSparseFeatureMatrix(java.io.PrintWriter pw)
          Prints a sparse feature matrix representation of the Dataset.
 void printSparseFeatureValues(int datumNo, java.io.PrintWriter pw)
          Prints a sparse feature-value output of the Dataset.
 void printSparseFeatureValues(java.io.PrintWriter pw)
          Prints a sparse feature-value output of the Dataset.
 void randomize(int randomSeed)
          Randomizes the data array in place Needs to be redefined here because we need to randomize the values as well
 void readSVMLightFormat(java.io.File file)
          Read SVM-light formatted data into this dataset.
static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename)
          Constructs a Dataset by reading in a file in SVM light format.
static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename, Index<java.lang.String> featureIndex, Index<java.lang.String> labelIndex)
          Constructs a Dataset by reading in a file in SVM light format.
static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename, java.util.List<java.lang.String> lines)
          Constructs a Dataset by reading in a file in SVM light format.
 RVFDataset<L,F> scaleDataset(RVFDataset<L,F> dataset)
          Scales the values of each feature in each linearly using the min and max values found in the training set.
 RVFDataset<L,F> scaleDatasetGaussian(RVFDataset<L,F> dataset)
           
 RVFDatum<L,F> scaleDatum(RVFDatum<L,F> datum)
          Scales the values of each feature linearly using the min and max values found in the training set.
 RVFDatum<L,F> scaleDatumGaussian(RVFDatum<L,F> datum)
           
 void scaleFeatures()
          scales feature values linearly such that each feature value lies between 0 and 1.
 void scaleFeaturesGaussian()
           
 void selectFeaturesFromSet(java.util.Set<F> featureSet)
          Removes all features from the dataset that are not in featureSet.
 Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
           
 Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start, int end)
           
 void summaryStatistics()
          Prints some summary statistics to stderr for the Dataset.
static RVFDatum<java.lang.String,java.lang.String> svmLightLineToRVFDatum(java.lang.String l)
           
 java.lang.String toString()
           
 java.lang.String toSummaryString()
           
 void writeSVMLightFormat(java.io.File file)
          Write the dataset in SVM-light format to the file.
 
Methods inherited from class edu.stanford.nlp.classify.GeneralDataset
addAll, featureIndex, getDataArray, getFeatureCounts, getLabelsArray, labelIndex, labelIterator, makeSvmLabelMap, mapDataset, mapDataset, mapDatum, numClasses, numFeatures, numFeatureTokens, numFeatureTypes, printSVMLightFormat, printSVMLightFormat, sampleDataset, size, trimData, trimLabels, trimToSize, trimToSize, trimToSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RVFDataset

public RVFDataset()

RVFDataset

public RVFDataset(int numDatums,
                  Index<F> featureIndex,
                  Index<L> labelIndex)

RVFDataset

public RVFDataset(Index<F> featureIndex,
                  Index<L> labelIndex)

RVFDataset

public RVFDataset(int numDatums)

RVFDataset

public RVFDataset(Index<L> labelIndex,
                  int[] labels,
                  Index<F> featureIndex,
                  int[][] data,
                  double[][] values)
Constructor that fully specifies a Dataset. Needed this for MulticlassDataset.

Method Detail

split

public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
Specified by:
split in class GeneralDataset<L,F>

scaleFeaturesGaussian

public void scaleFeaturesGaussian()

scaleFeatures

public void scaleFeatures()
scales feature values linearly such that each feature value lies between 0 and 1.


ensureRealValues

public void ensureRealValues()
checks if the dataset has any unbounded values. Always good to use this before training a model on the dataset. This way, one can avoid seeing the infamous 4's that get printed by the QuasiNewton Method when NaNs exist in the data! -Ramesh


scaleDataset

public RVFDataset<L,F> scaleDataset(RVFDataset<L,F> dataset)
Scales the values of each feature in each linearly using the min and max values found in the training set. NOTE1: Not guaranteed to be between 0 and 1 for a test datum. NOTE2: Also filters out features from each datum that are not seen at training time.

Parameters:
dataset -
Returns:
a new dataset

scaleDatum

public RVFDatum<L,F> scaleDatum(RVFDatum<L,F> datum)
Scales the values of each feature linearly using the min and max values found in the training set. NOTE1: Not guaranteed to be between 0 and 1 for a test datum. NOTE2: Also filters out features from the datum that are not seen at training time.

Parameters:
datum -
Returns:
a new datum

scaleDatasetGaussian

public RVFDataset<L,F> scaleDatasetGaussian(RVFDataset<L,F> dataset)

scaleDatumGaussian

public RVFDatum<L,F> scaleDatumGaussian(RVFDatum<L,F> datum)

split

public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start,
                                                           int end)
Specified by:
split in class GeneralDataset<L,F>

add

public void add(Datum<L,F> d)
Specified by:
add in class GeneralDataset<L,F>

add

public void add(Datum<L,F> d,
                java.lang.String src,
                java.lang.String id)

getDatum

public RVFDatum<L,F> getDatum(int index)
Specified by:
getDatum in class GeneralDataset<L,F>

getRVFDatum

public RVFDatum<L,F> getRVFDatum(int index)
Specified by:
getRVFDatum in class GeneralDataset<L,F>
Returns:
the index-ed datum Note, this returns a new RVFDatum object, not the original RVFDatum that was added to the dataset.

getRVFDatumSource

public java.lang.String getRVFDatumSource(int index)

getRVFDatumId

public java.lang.String getRVFDatumId(int index)

clear

public void clear()
Resets the Dataset so that it is empty and ready to collect data.

Overrides:
clear in class GeneralDataset<L,F>

clear

public void clear(int numDatums)
Resets the Dataset so that it is empty and ready to collect data.

Overrides:
clear in class GeneralDataset<L,F>
Parameters:
numDatums - initial capacity of dataset

initialize

protected void initialize(int numDatums)
Description copied from class: GeneralDataset
This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data

Specified by:
initialize in class GeneralDataset<L,F>
Parameters:
numDatums - initial capacity of dataset

summaryStatistics

public void summaryStatistics()
Prints some summary statistics to stderr for the Dataset.

Specified by:
summaryStatistics in class GeneralDataset<L,F>

printFullFeatureMatrix

public void printFullFeatureMatrix(java.io.PrintWriter pw)
prints the full feature matrix in tab-delimited form. These can be BIG matrices, so be careful! [Can also use printFullFeatureMatrixWithValues]


printFullFeatureMatrixWithValues

public void printFullFeatureMatrixWithValues(java.io.PrintWriter pw)
Modification of printFullFeatureMatrix to correct bugs & print values (Rajat). Prints the full feature matrix in tab-delimited form. These can be BIG matrices, so be careful!


readSVMLightFormat

public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename)
Constructs a Dataset by reading in a file in SVM light format.


readSVMLightFormat

public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename,
                                                                               java.util.List<java.lang.String> lines)
Constructs a Dataset by reading in a file in SVM light format. The lines parameter is filled with the lines of the file for further processing (if lines is null, it is assumed no line information is desired)


readSVMLightFormat

public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename,
                                                                               Index<java.lang.String> featureIndex,
                                                                               Index<java.lang.String> labelIndex)
Constructs a Dataset by reading in a file in SVM light format. the created dataset has the same feature and label index as given


selectFeaturesFromSet

public void selectFeaturesFromSet(java.util.Set<F> featureSet)
Removes all features from the dataset that are not in featureSet.

Parameters:
featureSet -

applyFeatureCountThreshold

public void applyFeatureCountThreshold(int k)
Applies a feature count threshold to the RVFDataset. All features that occur fewer than k times are expunged.

Overrides:
applyFeatureCountThreshold in class GeneralDataset<L,F>

applyFeatureMaxCountThreshold

public void applyFeatureMaxCountThreshold(int k)
Applies a feature max count threshold to the RVFDataset. All features that occur greater than k times are expunged.

Overrides:
applyFeatureMaxCountThreshold in class GeneralDataset<L,F>

svmLightLineToRVFDatum

public static RVFDatum<java.lang.String,java.lang.String> svmLightLineToRVFDatum(java.lang.String l)

readSVMLightFormat

public void readSVMLightFormat(java.io.File file)
Read SVM-light formatted data into this dataset. A strict SVM-light format is expected, where labels and features are both encoded as integers. These integers are converted into the dataset label and feature types using the indexes stored in this dataset.

Parameters:
file - The file from which the data should be read.

writeSVMLightFormat

public void writeSVMLightFormat(java.io.File file)
                         throws java.io.FileNotFoundException
Write the dataset in SVM-light format to the file. A strict SVM-light format will be written, where labels and features are both encoded as integers, using the label and feature indexes of this datset. Datasets written by this method can be read by readSVMLightFormat(File).

Parameters:
file - The location where the dataset should be written.
Throws:
java.io.FileNotFoundException

printSparseFeatureMatrix

public void printSparseFeatureMatrix()
Prints the sparse feature matrix using printSparseFeatureMatrix(PrintWriter) to System.out.


printSparseFeatureMatrix

public void printSparseFeatureMatrix(java.io.PrintWriter pw)
Prints a sparse feature matrix representation of the Dataset. Prints the actual Object.toString() representations of features.


printSparseFeatureValues

public void printSparseFeatureValues(java.io.PrintWriter pw)
Prints a sparse feature-value output of the Dataset. Prints the actual Object.toString() representations of features. This is probably what you want for RVFDataset since the above two methods seem useless and unused.


printSparseFeatureValues

public void printSparseFeatureValues(int datumNo,
                                     java.io.PrintWriter pw)
Prints a sparse feature-value output of the Dataset. Prints the actual Object.toString() representations of features. This is probably what you want for RVFDataset since the above two methods seem useless and unused.


main

public static void main(java.lang.String[] args)

getValuesArray

public double[][] getValuesArray()
Specified by:
getValuesArray in class GeneralDataset<L,F>

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

toSummaryString

public java.lang.String toSummaryString()

iterator

public java.util.Iterator<RVFDatum<L,F>> iterator()

Specified by:
iterator in interface java.lang.Iterable<RVFDatum<L,F>>
Overrides:
iterator in class GeneralDataset<L,F>

randomize

public void randomize(int randomSeed)
Randomizes the data array in place Needs to be redefined here because we need to randomize the values as well

Overrides:
randomize in class GeneralDataset<L,F>


Stanford NLP Group