edu.stanford.nlp.classify
Class GeneralDataset<L,F>

java.lang.Object
  extended by edu.stanford.nlp.classify.GeneralDataset<L,F>
Type Parameters:
L - The type of the labels in the Dataset
F - The type of the features in the Dataset
All Implemented Interfaces:
java.io.Serializable, java.lang.Iterable<RVFDatum<L,F>>
Direct Known Subclasses:
Dataset, RVFDataset

public abstract class GeneralDataset<L,F>
extends java.lang.Object
implements java.io.Serializable, java.lang.Iterable<RVFDatum<L,F>>

The purpose of this interface is to unify Dataset and RVFDataset.

Author:
Kristina Toutanova (kristina@cs.stanford.edu), Anna Rafferty (various refactoring with subclasses), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization), Ramesh Nallapati (nmramesh@cs.stanford.edu) Added an abstract method getDatum. Juily 17th, 2008.
See Also:
Serialized Form

Field Summary
protected  int[][] data
           
 Index<F> featureIndex
           
 Index<L> labelIndex
           
protected  int[] labels
           
protected  int size
           
 
Constructor Summary
GeneralDataset()
           
 
Method Summary
abstract  void add(Datum<L,F> d)
           
 void addAll(java.lang.Iterable<? extends Datum<L,F>> data)
          Adds all Datums in the given collection of data to this dataset
 void applyFeatureCountThreshold(int k)
          Applies a feature count threshold to the Dataset.
 void applyFeatureMaxCountThreshold(int k)
          Applies a max feature count threshold to the Dataset.
 void clear()
          Resets the Dataset so that it is empty and ready to collect data.
 void clear(int numDatums)
          Resets the Dataset so that it is empty and ready to collect data.
 Index<F> featureIndex()
           
 int[][] getDataArray()
           
abstract  Datum<L,F> getDatum(int index)
           
 float[] getFeatureCounts()
          Get the total count (over all data instances) of each feature
 int[] getLabelsArray()
           
abstract  RVFDatum<L,F> getRVFDatum(int index)
           
abstract  double[][] getValuesArray()
           
protected abstract  void initialize(int numDatums)
          This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data
 java.util.Iterator<RVFDatum<L,F>> iterator()
           
 Index<L> labelIndex()
           
 java.util.Iterator<L> labelIterator()
          Returns an iterator over the class labels of the Dataset
 java.lang.String[] makeSvmLabelMap()
          Maps our labels to labels that are compatible with svm_light
 GeneralDataset<L,F> mapDataset(GeneralDataset<L,F> dataset)
           
<L2> GeneralDataset<L2,F>
mapDataset(GeneralDataset<L,F> dataset, Index<L2> newLabelIndex, java.util.Map<L,L2> labelMapping, L2 defaultLabel)
           
static
<L,L2,F> Datum<L2,F>
mapDatum(Datum<L,F> d, java.util.Map<L,L2> labelMapping, L2 defaultLabel)
           
 int numClasses()
           
 int numFeatures()
           
 int numFeatureTokens()
          returns the number of feature tokens in the Dataset.
 int numFeatureTypes()
          returns the number of distinct feature types in the Dataset.
 void printSVMLightFormat()
          Dumps the Dataset as a training/test file for SVMLight.
 void printSVMLightFormat(java.io.PrintWriter pw)
          Print SVM Light Format file.
 void randomize(int randomSeed)
          Randomizes the data array in place Note: this cannot change the values array or the datum weights, so redefine this for RVFDataset and WeightedDataset!
 GeneralDataset<L,F> sampleDataset(int randomSeed, double sampleFrac, boolean sampleWithReplacement)
           
 int size()
          Returns the number of examples (Datums) in the Dataset.
abstract  Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double p)
           
abstract  Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start, int end)
           
abstract  void summaryStatistics()
          Print some statistics summarizing the dataset
protected  void trimData()
           
protected  void trimLabels()
           
protected  double[][] trimToSize(double[][] i)
           
protected  int[] trimToSize(int[] i)
           
protected  int[][] trimToSize(int[][] i)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

labelIndex

public Index<L> labelIndex

featureIndex

public Index<F> featureIndex

labels

protected int[] labels

data

protected int[][] data

size

protected int size
Constructor Detail

GeneralDataset

public GeneralDataset()
Method Detail

labelIndex

public Index<L> labelIndex()

featureIndex

public Index<F> featureIndex()

numFeatures

public int numFeatures()

numClasses

public int numClasses()

getLabelsArray

public int[] getLabelsArray()

getDataArray

public int[][] getDataArray()

getValuesArray

public abstract double[][] getValuesArray()

clear

public void clear()
Resets the Dataset so that it is empty and ready to collect data.


clear

public void clear(int numDatums)
Resets the Dataset so that it is empty and ready to collect data.

Parameters:
numDatums - initial capacity of dataset

initialize

protected abstract void initialize(int numDatums)
This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data

Parameters:
numDatums - initial capacity of dataset

getRVFDatum

public abstract RVFDatum<L,F> getRVFDatum(int index)

getDatum

public abstract Datum<L,F> getDatum(int index)

add

public abstract void add(Datum<L,F> d)

getFeatureCounts

public float[] getFeatureCounts()
Get the total count (over all data instances) of each feature

Returns:
an array containing the counts (indexed by index)

applyFeatureCountThreshold

public void applyFeatureCountThreshold(int k)
Applies a feature count threshold to the Dataset. All features that occur fewer than k times are expunged.


applyFeatureMaxCountThreshold

public void applyFeatureMaxCountThreshold(int k)
Applies a max feature count threshold to the Dataset. All features that occur greater than k times are expunged.


numFeatureTokens

public int numFeatureTokens()
returns the number of feature tokens in the Dataset.


numFeatureTypes

public int numFeatureTypes()
returns the number of distinct feature types in the Dataset.


addAll

public void addAll(java.lang.Iterable<? extends Datum<L,F>> data)
Adds all Datums in the given collection of data to this dataset

Parameters:
data - collection of datums you would like to add to the dataset

split

public abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start,
                                                                    int end)

split

public abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double p)

size

public int size()
Returns the number of examples (Datums) in the Dataset.


trimData

protected void trimData()

trimLabels

protected void trimLabels()

trimToSize

protected int[] trimToSize(int[] i)

trimToSize

protected int[][] trimToSize(int[][] i)

trimToSize

protected double[][] trimToSize(double[][] i)

randomize

public void randomize(int randomSeed)
Randomizes the data array in place Note: this cannot change the values array or the datum weights, so redefine this for RVFDataset and WeightedDataset!

Parameters:
randomSeed -

sampleDataset

public GeneralDataset<L,F> sampleDataset(int randomSeed,
                                         double sampleFrac,
                                         boolean sampleWithReplacement)

summaryStatistics

public abstract void summaryStatistics()
Print some statistics summarizing the dataset


labelIterator

public java.util.Iterator<L> labelIterator()
Returns an iterator over the class labels of the Dataset

Returns:
An iterator over the class labels of the Dataset

mapDataset

public GeneralDataset<L,F> mapDataset(GeneralDataset<L,F> dataset)
Parameters:
dataset -
Returns:
a new GeneralDataset whose features and ids map exactly to those of this GeneralDataset. Useful when two Datasets are created independently and one wants to train a model on one dataset and test on the other. -Ramesh.

mapDatum

public static <L,L2,F> Datum<L2,F> mapDatum(Datum<L,F> d,
                                            java.util.Map<L,L2> labelMapping,
                                            L2 defaultLabel)

mapDataset

public <L2> GeneralDataset<L2,F> mapDataset(GeneralDataset<L,F> dataset,
                                            Index<L2> newLabelIndex,
                                            java.util.Map<L,L2> labelMapping,
                                            L2 defaultLabel)
Parameters:
dataset -
Returns:
a new GeneralDataset whose features and ids map exactly to those of this GeneralDataset. But labels are converted to be another set of labels

printSVMLightFormat

public void printSVMLightFormat()
Dumps the Dataset as a training/test file for SVMLight.
class [fno:val]+ The features must occur in consecutive order.


makeSvmLabelMap

public java.lang.String[] makeSvmLabelMap()
Maps our labels to labels that are compatible with svm_light

Returns:
array of strings

printSVMLightFormat

public void printSVMLightFormat(java.io.PrintWriter pw)
Print SVM Light Format file. The following comments are no longer applicable because I am now printing out the exact labelID for each example. -Ramesh (nmramesh@cs.stanford.edu) 12/17/2009. If the Dataset has more than 2 classes, then it prints using the label index (+1) (for svm_struct). If it is 2 classes, then the labelIndex.get(0) is mapped to +1 and labelIndex.get(1) is mapped to -1 (for svm_light).


iterator

public java.util.Iterator<RVFDatum<L,F>> iterator()
Specified by:
iterator in interface java.lang.Iterable<RVFDatum<L,F>>


Stanford NLP Group