GeneralDataset (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.classify.GeneralDataset<L,F>

Type Parameters:

L - The type of the labels in the Dataset

F - The type of the features in the Dataset

All Implemented Interfaces:

java.io.Serializable, java.lang.Iterable<RVFDatum<L,F>>

Direct Known Subclasses:

Dataset, RVFDataset
```
public abstract class GeneralDataset<L,F>
extends java.lang.Object
implements java.io.Serializable, java.lang.Iterable<RVFDatum<L,F>>
```
The purpose of this interface is to unify Dataset and RVFDataset.
Note: Despite these being value classes, at present there are no equals() and hashCode() methods defined so you just get the default ones from Object, so different objects aren't equal.

Author:

Kristina Toutanova (kristina@cs.stanford.edu), Anna Rafferty (various refactoring with subclasses), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization), Ramesh Nallapati (nmramesh@cs.stanford.edu) (added an abstract method getDatum, July 17th, 2008)

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

protected int[][] data

Index<F> featureIndex

Index<L> labelIndex

protected int[] labels

protected int size

Fields
Modifier and Type	Field and Description
`protected int[][]`	`data`
`Index<F>`	`featureIndex`
`Index<L>`	`labelIndex`
`protected int[]`	`labels`
`protected int`	`size`

Constructor Summary

Constructors
Constructor and Description

GeneralDataset()

Constructors
Constructor and Description
`GeneralDataset()`

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`abstract void`	`add(Datum<L,F> d)`
`void`	`addAll(java.lang.Iterable<? extends Datum<L,F>> data)` Adds all Datums in the given collection of data to this dataset
`void`	`applyFeatureCountThreshold(int k)` Applies a feature count threshold to the Dataset.
`void`	`applyFeatureMaxCountThreshold(int k)` Applies a max feature count threshold to the Dataset.
`void`	`clear()` Resets the Dataset so that it is empty and ready to collect data.
`void`	`clear(int numDatums)` Resets the Dataset so that it is empty and ready to collect data.
`Index<F>`	`featureIndex()`
`int[][]`	`getDataArray()`
`abstract Datum<L,F>`	`getDatum(int index)`
`float[]`	`getFeatureCounts()` Get the total count (over all data instances) of each feature
`int[]`	`getLabelsArray()`
`abstract RVFDatum<L,F>`	`getRVFDatum(int index)`
`abstract double[][]`	`getValuesArray()`
`protected abstract void`	`initialize(int numDatums)` This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums.
`java.util.Iterator<RVFDatum<L,F>>`	`iterator()`
`Index<L>`	`labelIndex()`
`java.util.Iterator<L>`	`labelIterator()` Returns an iterator over the class labels of the Dataset
`java.lang.String[]`	`makeSvmLabelMap()` Maps our labels to labels that are compatible with svm_light
`GeneralDataset<L,F>`	`mapDataset(GeneralDataset<L,F> dataset)`
`<L2> GeneralDataset<L2,F>`	`mapDataset(GeneralDataset<L,F> dataset, Index<L2> newLabelIndex, java.util.Map<L,L2> labelMapping, L2 defaultLabel)`
`static <L,L2,F> Datum<L2,F>`	`mapDatum(Datum<L,F> d, java.util.Map<L,L2> labelMapping, L2 defaultLabel)`
`int`	`numClasses()`
`ClassicCounter<L>`	`numDatumsPerLabel()`
`int`	`numFeatures()`
`int`	`numFeatureTokens()` returns the number of feature tokens in the Dataset.
`int`	`numFeatureTypes()` returns the number of distinct feature types in the Dataset.
`abstract void`	`printSparseFeatureMatrix()` Prints the sparse feature matrix using `printSparseFeatureMatrix(PrintWriter)` to `System.out`.
`abstract void`	`printSparseFeatureMatrix(java.io.PrintWriter pw)` prints a sparse feature matrix representation of the Dataset.
`void`	`printSVMLightFormat()` Dumps the Dataset as a training/test file for SVMLight.
`void`	`printSVMLightFormat(java.io.PrintWriter pw)` Print SVM Light Format file.
`void`	`randomize(long randomSeed)` Randomizes the data array in place.
`void`	`retainFeatures(java.util.Set<F> features)` Retains the given features in the Dataset.
`GeneralDataset<L,F>`	`sampleDataset(long randomSeed, double sampleFrac, boolean sampleWithReplacement)`
`<E> void`	`shuffleWithSideInformation(long randomSeed, java.util.List<E> sideInformation)` Randomizes the data array in place.
`int`	`size()` Returns the number of examples (`Datum`s) in the Dataset.
`abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>>`	`split(double fractionSplit)` Divide out a (devtest) split from the start of the dataset and the rest of it (as a training set).
`abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>>`	`split(int start, int end)` Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
`Pair<GeneralDataset<L,F>,GeneralDataset<L,F>>`	`splitOutFold(int fold, int numFolds)` Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
`abstract void`	`summaryStatistics()` Print some statistics summarizing the dataset
`protected void`	`trimData()`
`protected void`	`trimLabels()`
`protected double[][]`	`trimToSize(double[][] i)`
`protected int[]`	`trimToSize(int[] i)`
`protected int[][]`	`trimToSize(int[][] i)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

- Field Detail
  - labelIndex
```
public Index<L> labelIndex
```
  - featureIndex
```
public Index<F> featureIndex
```
  - labels
```
protected int[] labels
```
  - data
```
protected int[][] data
```
  - size
```
protected int size
```
- Constructor Detail
  - GeneralDataset
```
public GeneralDataset()
```
- Method Detail
  - labelIndex
```
public Index<L> labelIndex()
```
  - featureIndex
```
public Index<F> featureIndex()
```
  - numFeatures
```
public int numFeatures()
```
  - numClasses
```
public int numClasses()
```
  - getLabelsArray
```
public int[] getLabelsArray()
```
  - getDataArray
```
public int[][] getDataArray()
```
  - getValuesArray
```
public abstract double[][] getValuesArray()
```
  - clear
```
public void clear()
```
    Resets the Dataset so that it is empty and ready to collect data.
  - clear
```
public void clear(int numDatums)
```
    Resets the Dataset so that it is empty and ready to collect data.
    
    Parameters:
    
    numDatums - initial capacity of dataset
  - initialize
```
protected abstract void initialize(int numDatums)
```
    This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums. Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data.
    
    Parameters:
    
    numDatums - initial capacity of dataset
  - getRVFDatum
```
public abstract RVFDatum<L,F> getRVFDatum(int index)
```
  - getDatum
```
public abstract Datum<L,F> getDatum(int index)
```
  - add
```
public abstract void add(Datum<L,F> d)
```
  - getFeatureCounts
```
public float[] getFeatureCounts()
```
    Get the total count (over all data instances) of each feature
    
    Returns:
    
    an array containing the counts (indexed by index)
  - applyFeatureCountThreshold
```
public void applyFeatureCountThreshold(int k)
```
    Applies a feature count threshold to the Dataset. All features that occur fewer than k times are expunged.
  - retainFeatures
```
public void retainFeatures(java.util.Set<F> features)
```
    Retains the given features in the Dataset. All features that do not occur in features are expunged.
  - applyFeatureMaxCountThreshold
```
public void applyFeatureMaxCountThreshold(int k)
```
    Applies a max feature count threshold to the Dataset. All features that occur greater than k times are expunged.
  - numFeatureTokens
```
public int numFeatureTokens()
```
    returns the number of feature tokens in the Dataset.
  - numFeatureTypes
```
public int numFeatureTypes()
```
    returns the number of distinct feature types in the Dataset.
  - addAll
```
public void addAll(java.lang.Iterable<? extends Datum<L,F>> data)
```
    Adds all Datums in the given collection of data to this dataset
    
    Parameters:
    
    data - collection of datums you would like to add to the dataset
  - split
```
public abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start,
                                                                    int end)
```
    Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
    
    Parameters:
    
    start - Begin devtest with this index (inclusive)
    
    end - End devtest before this index (exclusive)
    
    Returns:
    
    A Pair of data sets, the first being the remainder of size this.size() - (end-start) and the second being of size (end-start)
  - split
```
public abstract Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double fractionSplit)
```
    Divide out a (devtest) split from the start of the dataset and the rest of it (as a training set).
    
    Parameters:
    
    fractionSplit - The first fractionSplit of datums (rounded down) will be the second split
    
    Returns:
    
    A Pair of data sets, the first being the remainder of size ceiling(this.size() * (1-p)) drawn from the end of the dataset and the second of size floor(this.size() * p) drawn from the start of the dataset.
  - splitOutFold
```
public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> splitOutFold(int fold,
                                                                  int numFolds)
```
    Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
    
    Parameters:
    
    fold - The number of this fold (must be between 0 and (numFolds - 1)
    
    numFolds - The number of folds to divide the data into (must be greater than or equal to the size of the data set)
    
    Returns:
    
    A Pair of data sets, the first being roughly (numFolds-1)/numFolds of the data items (for use as training data_, and the second being 1/numFolds of the data, taken from the fold^th part of the data (for use as devTest data)
  - size
```
public int size()
```
    Returns the number of examples (Datums) in the Dataset.
  - trimData
```
protected void trimData()
```
  - trimLabels
```
protected void trimLabels()
```
  - trimToSize
```
protected int[] trimToSize(int[] i)
```
  - trimToSize
```
protected int[][] trimToSize(int[][] i)
```
  - trimToSize
```
protected double[][] trimToSize(double[][] i)
```
  - randomize
```
public void randomize(long randomSeed)
```
    Randomizes the data array in place. Note: this cannot change the values array or the datum weights, so redefine this for RVFDataset and WeightedDataset! This uses the Fisher-Yates (or Durstenfeld-Knuth) shuffle, which is unbiased. The same algorithm is used by shuffle() in j.u.Collections, and so you should get compatible results if using it on a Collection with the same seed (as of JDK1.7, at least).
    
    Parameters:
    
    randomSeed - A seed for the Random object (allows you to reproduce the same ordering)
  - shuffleWithSideInformation
```
public <E> void shuffleWithSideInformation(long randomSeed,
                                           java.util.List<E> sideInformation)
```
    Randomizes the data array in place. Note: this cannot change the values array or the datum weights, so redefine this for RVFDataset and WeightedDataset! This uses the Fisher-Yates (or Durstenfeld-Knuth) shuffle, which is unbiased. The same algorithm is used by shuffle() in j.u.Collections, and so you should get compatible results if using it on a Collection with the same seed (as of JDK1.7, at least).
    
    Parameters:
    
    randomSeed - A seed for the Random object (allows you to reproduce the same ordering)
  - sampleDataset
```
public GeneralDataset<L,F> sampleDataset(long randomSeed,
                                         double sampleFrac,
                                         boolean sampleWithReplacement)
```
  - summaryStatistics
```
public abstract void summaryStatistics()
```
    Print some statistics summarizing the dataset
  - labelIterator
```
public java.util.Iterator<L> labelIterator()
```
    Returns an iterator over the class labels of the Dataset
    
    Returns:
    
    An iterator over the class labels of the Dataset
  - mapDataset
```
public GeneralDataset<L,F> mapDataset(GeneralDataset<L,F> dataset)
```
    Parameters:
    
    dataset -
    
    Returns:
    
    a new GeneralDataset whose features and ids map exactly to those of this GeneralDataset. Useful when two Datasets are created independently and one wants to train a model on one dataset and test on the other. -Ramesh.
  - mapDatum
```
public static <L,L2,F> Datum<L2,F> mapDatum(Datum<L,F> d,
                                            java.util.Map<L,L2> labelMapping,
                                            L2 defaultLabel)
```
  - mapDataset
```
public <L2> GeneralDataset<L2,F> mapDataset(GeneralDataset<L,F> dataset,
                                            Index<L2> newLabelIndex,
                                            java.util.Map<L,L2> labelMapping,
                                            L2 defaultLabel)
```
    Parameters:
    
    dataset -
    
    Returns:
    
    a new GeneralDataset whose features and ids map exactly to those of this GeneralDataset. But labels are converted to be another set of labels
  - printSVMLightFormat
```
public void printSVMLightFormat()
```
    Dumps the Dataset as a training/test file for SVMLight.
    class [fno:val]+ The features must occur in consecutive order.
  - makeSvmLabelMap
```
public java.lang.String[] makeSvmLabelMap()
```
    Maps our labels to labels that are compatible with svm_light
    
    Returns:
    
    array of strings
  - printSVMLightFormat
```
public void printSVMLightFormat(java.io.PrintWriter pw)
```
    Print SVM Light Format file. The following comments are no longer applicable because I am now printing out the exact labelID for each example. -Ramesh (nmramesh@cs.stanford.edu) 12/17/2009. If the Dataset has more than 2 classes, then it prints using the label index (+1) (for svm_struct). If it is 2 classes, then the labelIndex.get(0) is mapped to +1 and labelIndex.get(1) is mapped to -1 (for svm_light).
  - iterator
```
public java.util.Iterator<RVFDatum<L,F>> iterator()
```
    Specified by:
    
    iterator in interface java.lang.Iterable<RVFDatum<L,F>>
  - numDatumsPerLabel
```
public ClassicCounter<L> numDatumsPerLabel()
```
  - printSparseFeatureMatrix
```
public abstract void printSparseFeatureMatrix()
```
    Prints the sparse feature matrix using printSparseFeatureMatrix(PrintWriter) to System.out.
  - printSparseFeatureMatrix
```
public abstract void printSparseFeatureMatrix(java.io.PrintWriter pw)
```
    prints a sparse feature matrix representation of the Dataset. Prints the actual Object.toString() representations of features.

Class GeneralDataset<L,F>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Field Detail

labelIndex

featureIndex

labels

data

size

Constructor Detail

GeneralDataset

Method Detail

labelIndex

featureIndex

numFeatures

numClasses

getLabelsArray

getDataArray

getValuesArray

clear

clear

initialize

getRVFDatum

getDatum

add

getFeatureCounts

applyFeatureCountThreshold

retainFeatures

applyFeatureMaxCountThreshold

numFeatureTokens

numFeatureTypes

addAll

split

split

splitOutFold

size

trimData

trimLabels

trimToSize

trimToSize

trimToSize

randomize

shuffleWithSideInformation

sampleDataset

summaryStatistics

labelIterator

mapDataset

mapDatum

mapDataset

printSVMLightFormat

makeSvmLabelMap

printSVMLightFormat

iterator

numDatumsPerLabel

printSparseFeatureMatrix

printSparseFeatureMatrix