L
- Label typeF
- Feature typepublic class Dataset<L,F> extends GeneralDataset<L,F>
ClassifierFactory
that incrementally
builds a more memory-efficient representation of a List
of
Datum
objects for the purposes of training a Classifier
with a ClassifierFactory
.and #getL1NormalizedTFIDFDataset()
data, featureIndex, labelIndex, labels, size
Constructor and Description |
---|
Dataset() |
Dataset(Index<F> featureIndex,
Index<L> labelIndex) |
Dataset(Index<L> labelIndex,
int[] labels,
Index<F> featureIndex,
int[][] data)
Constructor that fully specifies a Dataset.
|
Dataset(Index<L> labelIndex,
int[] labels,
Index<F> featureIndex,
int[][] data,
int size)
Constructor that fully specifies a Dataset.
|
Dataset(int numDatums) |
Dataset(int numDatums,
Index<F> featureIndex,
Index<L> labelIndex) |
Modifier and Type | Method and Description |
---|---|
void |
add(Collection<F> features,
L label) |
void |
add(Collection<F> features,
L label,
boolean addNewFeatures) |
void |
add(Datum<L,F> d) |
void |
add(int[] features,
int label)
Adds a datums defined by feature indices and label index
Careful with this one! Make sure that all indices are valid!
|
protected void |
addFeatureIndices(int[] features) |
protected void |
addFeatures(Collection<F> features) |
protected void |
addFeatures(Collection<F> features,
boolean addNewFeatures) |
protected void |
addLabel(L label) |
protected void |
addLabelIndex(int label) |
void |
applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
Applies feature count thresholds to the Dataset.
|
void |
changeFeatureIndex(Index<F> newFeatureIndex) |
void |
changeLabelIndex(Index<L> newLabelIndex) |
protected void |
ensureSize() |
Datum<L,F> |
getDatum(int index) |
Counter<F> |
getFeatureCounter()
Get Number of datums a given feature appears in.
|
double[] |
getInformationGains() |
RVFDataset<L,F> |
getL1NormalizedTFIDFDataset()
Method to convert this dataset to RVFDataset using L1-normalized TF-IDF features
|
RVFDatum<L,F> |
getL1NormalizedTFIDFDatum(Datum<L,F> datum,
Counter<F> featureDocCounts)
Method to convert features from counts to L1-normalized TFIDF based features
|
Dataset<L,F> |
getRandomSubDataset(double p,
int seed) |
RVFDatum<L,F> |
getRVFDatum(int index) |
double[][] |
getValuesArray() |
protected void |
initialize(int numDatums)
This method takes care of resetting values of the dataset
such that it is empty with an initial capacity of numDatums.
|
void |
printFullFeatureMatrix(PrintWriter pw)
prints the full feature matrix in tab-delimited form.
|
void |
printSparseFeatureMatrix()
Prints the sparse feature matrix using
GeneralDataset.printSparseFeatureMatrix(PrintWriter) to System.out . |
void |
printSparseFeatureMatrix(PrintWriter pw)
prints a sparse feature matrix representation of the Dataset.
|
static void |
printSVMLightFormat(PrintWriter pw,
ClassicCounter<Integer> c,
int classNo)
Need to sort the counter by feature keys and dump it
|
static Dataset<String,String> |
readSVMLightFormat(String filename)
Constructs a Dataset by reading in a file in SVM light format.
|
static Dataset<String,String> |
readSVMLightFormat(String filename,
Index<String> featureIndex,
Index<String> labelIndex)
Constructs a Dataset by reading in a file in SVM light format.
|
static Dataset<String,String> |
readSVMLightFormat(String filename,
Index<String> featureIndex,
Index<String> labelIndex,
List<String> lines)
Constructs a Dataset by reading in a file in SVM light format.
|
static Dataset<String,String> |
readSVMLightFormat(String filename,
List<String> lines)
Constructs a Dataset by reading in a file in SVM light format.
|
void |
selectFeatures(int numFeatures,
double[] scores)
Generic method to select features based on the feature scores vector provided as an argument.
|
void |
selectFeaturesBinaryInformationGain(int numFeatures) |
Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> |
split(double percentDev)
Divide out a (devtest) split from the start of the dataset and the rest of it (as a training set).
|
Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> |
split(int start,
int end)
Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
|
void |
summaryStatistics()
Prints some summary statistics to stderr for the Dataset.
|
static Datum<String,String> |
svmLightLineToDatum(String l) |
String |
toString() |
String |
toSummaryStatistics()
A String that is multiple lines of text giving summary statistics.
|
String |
toSummaryString() |
void |
updateLabels(int[] labels) |
addAll, applyFeatureCountThreshold, applyFeatureMaxCountThreshold, clear, clear, featureIndex, getDataArray, getFeatureCounts, getLabelsArray, iterator, labelIndex, labelIterator, makeSvmLabelMap, mapDataset, mapDataset, mapDatum, numClasses, numDatumsPerLabel, numFeatures, numFeatureTokens, numFeatureTypes, printSVMLightFormat, printSVMLightFormat, randomize, retainFeatures, sampleDataset, shuffleWithSideInformation, size, splitOutFold, trimData, trimLabels, trimToSize, trimToSize, trimToSize
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
forEach, spliterator
public Dataset()
public Dataset(int numDatums)
public Dataset(Index<L> labelIndex, int[] labels, Index<F> featureIndex, int[][] data)
public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
split
in class GeneralDataset<L,F>
percentDev
- The first fractionSplit of datums (rounded down) will be the second splitpublic Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start, int end)
split
in class GeneralDataset<L,F>
start
- Begin devtest with this index (inclusive)end
- End devtest before this index (exclusive)public double[][] getValuesArray()
getValuesArray
in class GeneralDataset<L,F>
public static Dataset<String,String> readSVMLightFormat(String filename)
public static Dataset<String,String> readSVMLightFormat(String filename, List<String> lines)
public static Dataset<String,String> readSVMLightFormat(String filename, Index<String> featureIndex, Index<String> labelIndex)
public static Dataset<String,String> readSVMLightFormat(String filename, Index<String> featureIndex, Index<String> labelIndex, List<String> lines)
public Counter<F> getFeatureCounter()
public RVFDatum<L,F> getL1NormalizedTFIDFDatum(Datum<L,F> datum, Counter<F> featureDocCounts)
datum
- with a collection of features.featureDocCounts
- a counter of doc-count for each feature.public RVFDataset<L,F> getL1NormalizedTFIDFDataset()
public void add(Collection<F> features, L label)
public void add(Collection<F> features, L label, boolean addNewFeatures)
public void add(int[] features, int label)
features
- label
- protected void ensureSize()
protected void addLabel(L label)
protected void addLabelIndex(int label)
protected void addFeatures(Collection<F> features)
protected void addFeatures(Collection<F> features, boolean addNewFeatures)
protected void addFeatureIndices(int[] features)
protected final void initialize(int numDatums)
GeneralDataset
initialize
in class GeneralDataset<L,F>
numDatums
- initial capacity of datasetpublic Datum<L,F> getDatum(int index)
getDatum
in class GeneralDataset<L,F>
public RVFDatum<L,F> getRVFDatum(int index)
getRVFDatum
in class GeneralDataset<L,F>
public void summaryStatistics()
summaryStatistics
in class GeneralDataset<L,F>
public String toSummaryStatistics()
public void applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
thresholds
- a list of pattern, threshold pairspublic void printFullFeatureMatrix(PrintWriter pw)
public void printSparseFeatureMatrix()
GeneralDataset.printSparseFeatureMatrix(PrintWriter)
to System.out
.printSparseFeatureMatrix
in class GeneralDataset<L,F>
public void printSparseFeatureMatrix(PrintWriter pw)
Object.toString()
representations of features.printSparseFeatureMatrix
in class GeneralDataset<L,F>
public void selectFeaturesBinaryInformationGain(int numFeatures)
public void selectFeatures(int numFeatures, double[] scores)
numFeatures
- number of features to be selected.scores
- a vector of size total number of features in the data.public double[] getInformationGains()
public void updateLabels(int[] labels)
public String toSummaryString()
public static void printSVMLightFormat(PrintWriter pw, ClassicCounter<Integer> c, int classNo)