nltk.HeldoutProbDist

class nltk.HeldoutProbDist(base_fdist, heldout_fdist, bins=None)[source]

The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the “heldout frequency distribution” and the “base frequency distribution.” The “heldout estimate” uses uses the “heldout frequency distribution” to predict the probability of each sample, given its frequency in the “base frequency distribution”.

In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.

This average frequency is Tr[r]/(Nr[r].N), where:

  • Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
  • Nr[r] is the number of samples that occur r times in the base distribution.
  • N is the number of outcomes recorded by the heldout frequency distribution.

In order to increase the efficiency of the prob member function, Tr[r]/(Nr[r].N) is precomputed for each value of r when the HeldoutProbDist is created.

Variables:
  • _estimate – A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample. _estimate[r] is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular, _estimate[r] = Tr[r]/(Nr[r].N).
  • _max_r – The maximum number of times that any sample occurs in the base distribution. _max_r is used to decide how large _estimate must be.

Methods

__init__(base_fdist, heldout_fdist[, bins]) Use the heldout estimate to create a probability distribution for the experiment used to generate base_fdist and heldout_fdist.
base_fdist() Return the base frequency distribution that this probability distribution is based on.
discount()
generate() Return a randomly selected sample from this probability distribution.
heldout_fdist() Return the heldout frequency distribution that this probability distribution is based on.
logprob(sample) Return the base 2 logarithm of the probability for a given sample.
max()
prob(sample)
samples()
unicode_repr()
rtype:str

Attributes

SUM_TO_ONE