mark.nlp.features
Class CorpusCounter

java.lang.Object
  |
  +--mark.nlp.features.CorpusCounter
Direct Known Subclasses:
BagCorpusCounter

public abstract class CorpusCounter
extends java.lang.Object

Assume a sampling process, where each sample determines two discrete random variables, C and W. The range of C is the integers in [0, l(C) - 1], where l(C) is the number of values C can assume. The range of W is the set of distinct objects {W_0, ..., W_(l(W)-1)}, where l(W) is the number of values W can assume. A CorpusCounter maintains joint counts of C and W. Counts may be doubles, meaning that a count of 0.5, for example, is legal.


Constructor Summary
CorpusCounter()
           
 
Method Summary
abstract  int CLength()
          Returns the number of values C can assume.
 Table countTable(int c, java.lang.Object w)
          Return (smoothed) counts in a 2-by-2 contingency table.
 Table countTable(java.lang.Object w)
          Return (smoothed) counts in a l(C)-by-2 contingency table.
abstract  double num()
          Returns the total number of samples.
abstract  double numC(int c)
          Returns #(C = c).
abstract  double numCW(int c, java.lang.Object w)
          Returns #(C = c, W = w).
abstract  double numW(java.lang.Object w)
          Returns #(W = w).
abstract  java.util.Iterator WIterator()
          Returns an iterator over the values W can assume.
abstract  java.util.Iterator WIterator(int c)
          Returns an iterator over the values W assumes in conjunction with a given value of C.
abstract  int WLength()
          Returns the number of values W can assume.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CorpusCounter

public CorpusCounter()
Method Detail

WIterator

public abstract java.util.Iterator WIterator()
Returns an iterator over the values W can assume.

Returns:
the iterator.

WIterator

public abstract java.util.Iterator WIterator(int c)
Returns an iterator over the values W assumes in conjunction with a given value of C.

Parameters:
c - the C value.
Returns:
the iterator.

CLength

public abstract int CLength()
Returns the number of values C can assume.

Returns:
l(C).

WLength

public abstract int WLength()
Returns the number of values W can assume.

Returns:
l(W).

numCW

public abstract double numCW(int c,
                             java.lang.Object w)
Returns #(C = c, W = w).

Parameters:
c - the C value.
w - the W value.
Returns:
the count.

numC

public abstract double numC(int c)
Returns #(C = c).

Parameters:
c - the C value.
Returns:
the count.

numW

public abstract double numW(java.lang.Object w)
Returns #(W = w).

Parameters:
w - the W value.
Returns:
the count.

num

public abstract double num()
Returns the total number of samples.

Returns:
the count.

countTable

public Table countTable(int c,
                        java.lang.Object w)
Return (smoothed) counts in a 2-by-2 contingency table. For a given value of W and a given value of C, generates a table as follows:
       | ^w | w
   ----+----+----
   ^c  |    |
   ----+----+----
   c   |    |
 
For example, a cell in row ^c and col ^w indicates the number of samples not in class c and that are not w. We use lidstone smoothing.

Parameters:
c - the C value.
w - the W value.
Returns:
the count table.

countTable

public Table countTable(java.lang.Object w)
Return (smoothed) counts in a l(C)-by-2 contingency table. For a given value of W, generates a count table as follows:
        | ^w | w
   -----+----+----
   c_0  |    |
   -----+----+----
    .   |    .
    .   |    .
    .   |    .
   -----+----+----
   c_n  |    |
 
A cell in row c_i and col ^w indicates the number of samples in class c_i that are not w. A cell in row c_i and col w indicates the number of of samples in class c_i that are w. (We use lidstone smoothing).

Parameters:
w - the W value.
Returns:
the count table.