edu.stanford.nlp.stats
Class SimpleGoodTuring

java.lang.Object
  extended by edu.stanford.nlp.stats.SimpleGoodTuring

public class SimpleGoodTuring
extends Object

Simple Good-Turing smoothing, based on code from Sampson, available at: ftp://ftp.informatics.susx.ac.uk/pub/users/grs2/SGT.c

See also http://www.grsampson.net/RGoodTur.html

Author:
Bill MacCartney (wcmac@cs.stanford.edu)

Constructor Summary
SimpleGoodTuring(int[] r, int[] n)
          Each instance of this class encapsulates the computation of the smoothing for one probability distribution.
 
Method Summary
 double[] getProbabilities()
          Returns the probabilities allocated to each type, according to their count in the underlying collection.
 double getProbabilityForUnseen()
          Returns the probability allocated to types not seen in the underlying collection.
static void main(String[] args)
          Like Sampson's SGT program, reads data from STDIN and writes results to STDOUT.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleGoodTuring

public SimpleGoodTuring(int[] r,
                        int[] n)
Each instance of this class encapsulates the computation of the smoothing for one probability distribution. The constructor takes two arguments which are two parallel arrays. The first is an array of counts, which must be positive and in ascending order. The second is an array of corresponding counts of counts; that is, for each i, n[i] represents the number of types which occurred with count r[i] in the underlying collection. See the documentation for main() for a concrete example.

Method Detail

getProbabilityForUnseen

public double getProbabilityForUnseen()
Returns the probability allocated to types not seen in the underlying collection.


getProbabilities

public double[] getProbabilities()
Returns the probabilities allocated to each type, according to their count in the underlying collection. The returned array parallels the arrays passed in to the constructor. If the returned array is designated p, then for all i, p[i] represents the smoothed probability assigned to types which occurred r[i] times in the underlying collection (where r is the first argument to the constructor).


main

public static void main(String[] args)
                 throws Exception
Like Sampson's SGT program, reads data from STDIN and writes results to STDOUT. The input should contain two integers on each line, separated by whitespace. The first integer is a count; the second is a count for that count. The input must be sorted in ascending order, and should not contain 0s. For example, valid input is:

   1 10
   2 6
   3 4
   5 2
   8 1
 
This represents a collection in which 10 types occur once each, 6 types occur twice each, 4 types occur 3 times each, 2 types occur 5 times each, and one type occurs 10 times, for a total count of 52. This input will produce the following output:

     r      n        p       p*
  ----   ----     ----     ----
     0      0    0.000   0.1923
     1     10  0.01923  0.01203
     2      6  0.03846  0.02951
     3      4  0.05769  0.04814
     5      2  0.09615  0.08647
     8      1   0.1538   0.1448
 
The last column represents the smoothed probabilities, and the first item in this column represents the probability assigned to unseen items.

Throws:
Exception


Stanford NLP Group