edu.stanford.nlp.classify
Class Dataset<L,F>

java.lang.Object
  extended by edu.stanford.nlp.classify.GeneralDataset<L,F>
      extended by edu.stanford.nlp.classify.Dataset<L,F>
Type Parameters:
L - Label type
F - Feature type
All Implemented Interfaces:
Serializable, Iterable<RVFDatum<L,F>>
Direct Known Subclasses:
WeightedDataset

public class Dataset<L,F>
extends GeneralDataset<L,F>

An interfacing class for ClassifierFactory that incrementally builds a more memory-efficient representation of a List of Datum objects for the purposes of training a Classifier with a ClassifierFactory.

Author:
Roger Levy (rog@stanford.edu), Anna Rafferty (various refactoring with GeneralDataset/RVFDataset), Sarah Spikes (sdspikes@cs.stanford.edu) (templatization), nmramesh@cs.stanford.edu and #getL1NormalizedTFIDFDataset()
See Also:
Serialized Form

Field Summary
 
Fields inherited from class edu.stanford.nlp.classify.GeneralDataset
data, featureIndex, labelIndex, labels, size
 
Constructor Summary
Dataset()
           
Dataset(Index<F> featureIndex, Index<L> labelIndex)
           
Dataset(Index<L> labelIndex, int[] labels, Index<F> featureIndex, int[][] data)
          Constructor that fully specifies a Dataset.
Dataset(Index<L> labelIndex, int[] labels, Index<F> featureIndex, int[][] data, int size)
          Constructor that fully specifies a Dataset.
Dataset(int numDatums)
           
Dataset(int numDatums, Index<F> featureIndex, Index<L> labelIndex)
           
 
Method Summary
 void add(Collection<F> features, L label)
           
 void add(Collection<F> features, L label, boolean addNewFeatures)
           
 void add(Datum<L,F> d)
           
 void add(int[] features, int label)
          Adds a datums defined by feature indices and label index Careful with this one! Make sure that all indices are valid!
protected  void addFeatureIndices(int[] features)
           
protected  void addFeatures(Collection<F> features)
           
protected  void addFeatures(Collection<F> features, boolean addNewFeatures)
           
protected  void addLabel(L label)
           
protected  void addLabelIndex(int label)
           
 void applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
          Applies feature count thresholds to the Dataset.
 void changeFeatureIndex(Index<F> newFeatureIndex)
           
 void changeLabelIndex(Index<L> newLabelIndex)
           
protected  void ensureSize()
           
 Datum<L,F> getDatum(int index)
           
 Counter<F> getFeatureCounter()
          Get Number of datums a given feature appears in.
 double[] getInformationGains()
           
 RVFDataset<L,F> getL1NormalizedTFIDFDataset()
          Method to convert this dataset to RVFDataset using L1-normalized TF-IDF features
 RVFDatum<L,F> getL1NormalizedTFIDFDatum(Datum<L,F> datum, Counter<F> featureDocCounts)
          Method to convert features from counts to L1-normalized TFIDF based features
 Dataset<L,F> getRandomSubDataset(double p, int seed)
           
 RVFDatum<L,F> getRVFDatum(int index)
           
 double[][] getValuesArray()
           
protected  void initialize(int numDatums)
          This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data
 void printFullFeatureMatrix(PrintWriter pw)
          prints the full feature matrix in tab-delimited form.
 void printSparseFeatureMatrix()
          prints the sparse feature matrix using printSparseFeatureMatrix() to System.out.
 void printSparseFeatureMatrix(PrintWriter pw)
          prints a sparse feature matrix representation of the Dataset.
static void printSVMLightFormat(PrintWriter pw, ClassicCounter<Integer> c, int classNo)
          Need to sort the counter by feature keys and dump it
static Dataset<String,String> readSVMLightFormat(String filename)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset<String,String> readSVMLightFormat(String filename, Index<String> featureIndex, Index<String> labelIndex)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset<String,String> readSVMLightFormat(String filename, Index<String> featureIndex, Index<String> labelIndex, List<String> lines)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset<String,String> readSVMLightFormat(String filename, List<String> lines)
          Constructs a Dataset by reading in a file in SVM light format.
 void selectFeatures(int numFeatures, double[] scores)
          Generic method to select features based on the feature scores vector provided as an argument.
 void selectFeaturesBinaryInformationGain(int numFeatures)
           
 Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
           
 Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start, int end)
           
 void summaryStatistics()
          Prints some summary statistics to stderr for the Dataset.
static Datum<String,String> svmLightLineToDatum(String l)
           
 String toString()
           
 String toSummaryStatistics()
           
 String toSummaryString()
           
 void updateLabels(int[] labels)
           
 
Methods inherited from class edu.stanford.nlp.classify.GeneralDataset
addAll, applyFeatureCountThreshold, applyFeatureMaxCountThreshold, clear, clear, featureIndex, getDataArray, getFeatureCounts, getLabelsArray, iterator, labelIndex, labelIterator, makeSvmLabelMap, mapDataset, mapDataset, mapDatum, numClasses, numFeatures, numFeatureTokens, numFeatureTypes, printSVMLightFormat, printSVMLightFormat, randomize, sampleDataset, size, trimData, trimLabels, trimToSize, trimToSize, trimToSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Dataset

public Dataset()

Dataset

public Dataset(int numDatums,
               Index<F> featureIndex,
               Index<L> labelIndex)

Dataset

public Dataset(Index<F> featureIndex,
               Index<L> labelIndex)

Dataset

public Dataset(int numDatums)

Dataset

public Dataset(Index<L> labelIndex,
               int[] labels,
               Index<F> featureIndex,
               int[][] data)
Constructor that fully specifies a Dataset. Needed this for MulticlassDataset.


Dataset

public Dataset(Index<L> labelIndex,
               int[] labels,
               Index<F> featureIndex,
               int[][] data,
               int size)
Constructor that fully specifies a Dataset. Needed this for MulticlassDataset.

Method Detail

split

public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
Specified by:
split in class GeneralDataset<L,F>

split

public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start,
                                                           int end)
Specified by:
split in class GeneralDataset<L,F>

getRandomSubDataset

public Dataset<L,F> getRandomSubDataset(double p,
                                        int seed)

getValuesArray

public double[][] getValuesArray()
Specified by:
getValuesArray in class GeneralDataset<L,F>

readSVMLightFormat

public static Dataset<String,String> readSVMLightFormat(String filename)
Constructs a Dataset by reading in a file in SVM light format.


readSVMLightFormat

public static Dataset<String,String> readSVMLightFormat(String filename,
                                                        List<String> lines)
Constructs a Dataset by reading in a file in SVM light format. The lines parameter is filled with the lines of the file for further processing (if lines is null, it is assumed no line information is desired)


readSVMLightFormat

public static Dataset<String,String> readSVMLightFormat(String filename,
                                                        Index<String> featureIndex,
                                                        Index<String> labelIndex)
Constructs a Dataset by reading in a file in SVM light format. the created dataset has the same feature and label index as given


readSVMLightFormat

public static Dataset<String,String> readSVMLightFormat(String filename,
                                                        Index<String> featureIndex,
                                                        Index<String> labelIndex,
                                                        List<String> lines)
Constructs a Dataset by reading in a file in SVM light format. the created dataset has the same feature and label index as given


svmLightLineToDatum

public static Datum<String,String> svmLightLineToDatum(String l)

getFeatureCounter

public Counter<F> getFeatureCounter()
Get Number of datums a given feature appears in.


getL1NormalizedTFIDFDatum

public RVFDatum<L,F> getL1NormalizedTFIDFDatum(Datum<L,F> datum,
                                               Counter<F> featureDocCounts)
Method to convert features from counts to L1-normalized TFIDF based features

Parameters:
datum - with a collection of features.
featureDocCounts - a counter of doc-count for each feature.
Returns:
RVFDatum with l1-normalized tf-idf features.

getL1NormalizedTFIDFDataset

public RVFDataset<L,F> getL1NormalizedTFIDFDataset()
Method to convert this dataset to RVFDataset using L1-normalized TF-IDF features

Returns:
RVFDataset

add

public void add(Datum<L,F> d)
Specified by:
add in class GeneralDataset<L,F>

add

public void add(Collection<F> features,
                L label)

add

public void add(Collection<F> features,
                L label,
                boolean addNewFeatures)

add

public void add(int[] features,
                int label)
Adds a datums defined by feature indices and label index Careful with this one! Make sure that all indices are valid!

Parameters:
features -
label -

ensureSize

protected void ensureSize()

addLabel

protected void addLabel(L label)

addLabelIndex

protected void addLabelIndex(int label)

addFeatures

protected void addFeatures(Collection<F> features)

addFeatures

protected void addFeatures(Collection<F> features,
                           boolean addNewFeatures)

addFeatureIndices

protected void addFeatureIndices(int[] features)

initialize

protected final void initialize(int numDatums)
Description copied from class: GeneralDataset
This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data

Specified by:
initialize in class GeneralDataset<L,F>
Parameters:
numDatums - initial capacity of dataset

getDatum

public Datum<L,F> getDatum(int index)
Specified by:
getDatum in class GeneralDataset<L,F>
Returns:
the index-ed datum

getRVFDatum

public RVFDatum<L,F> getRVFDatum(int index)
Specified by:
getRVFDatum in class GeneralDataset<L,F>
Returns:
the index-ed datum

summaryStatistics

public void summaryStatistics()
Prints some summary statistics to stderr for the Dataset.

Specified by:
summaryStatistics in class GeneralDataset<L,F>

toSummaryStatistics

public String toSummaryStatistics()

applyFeatureCountThreshold

public void applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
Applies feature count thresholds to the Dataset. Only features that match pattern_i and occur at least threshold_i times (for some i) are kept.

Parameters:
thresholds - a list of pattern, threshold pairs

printFullFeatureMatrix

public void printFullFeatureMatrix(PrintWriter pw)
prints the full feature matrix in tab-delimited form. These can be BIG matrices, so be careful!


printSparseFeatureMatrix

public void printSparseFeatureMatrix()
prints the sparse feature matrix using printSparseFeatureMatrix() to System.out.


printSparseFeatureMatrix

public void printSparseFeatureMatrix(PrintWriter pw)
prints a sparse feature matrix representation of the Dataset. Prints the actual Object.toString() representations of features.


changeLabelIndex

public void changeLabelIndex(Index<L> newLabelIndex)

changeFeatureIndex

public void changeFeatureIndex(Index<F> newFeatureIndex)

selectFeaturesBinaryInformationGain

public void selectFeaturesBinaryInformationGain(int numFeatures)

selectFeatures

public void selectFeatures(int numFeatures,
                           double[] scores)
Generic method to select features based on the feature scores vector provided as an argument.

Parameters:
numFeatures - number of features to be selected.
scores - a vector of size total number of features in the data.

getInformationGains

public double[] getInformationGains()

updateLabels

public void updateLabels(int[] labels)

toString

public String toString()
Overrides:
toString in class Object

toSummaryString

public String toSummaryString()

printSVMLightFormat

public static void printSVMLightFormat(PrintWriter pw,
                                       ClassicCounter<Integer> c,
                                       int classNo)
Need to sort the counter by feature keys and dump it



Stanford NLP Group