nltk.SimpleGoodTuringProbDist

class nltk.SimpleGoodTuringProbDist(freqdist, bins=None)[source]

SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:

  • Good Turing smoothing without tears” (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
  • “Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
  • http://www.grsampson.net/RGoodTur.html

Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.

  • slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
  • intercept: a = E(y) - b.E(x)

Methods

__init__(freqdist[, bins])
param freqdist:The frequency counts upon which to base the
check()
discount() This function returns the total mass of probability transfers from the seen samples to the unseen samples.
find_best_fit(r, nr) Use simple linear regression to tune parameters self._slope and
freqdist()
generate() Return a randomly selected sample from this probability distribution.
logprob(sample) Return the base 2 logarithm of the probability for a given sample.
max()
prob(sample) Return the sample’s probability.
samples()
smoothedNr(r) Return the number of samples with count r.
unicode_repr() Return a string representation of this ProbDist.

Attributes

SUM_TO_ONE