nltk.BigramAssocMeasures

class nltk.BigramAssocMeasures[source]

A collection of bigram association measures. Each association measure is provided as a function with three arguments:

bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:

n_ii counts (w1, w2), i.e. the bigram being scored n_ix counts (w1, ) n_xi counts (, w2) n_xx counts (*, *), i.e. any bigram

This may be shown with respect to a contingency table:

        w1    ~w1
     ------ ------
 w2 | n_ii | n_oi | = n_xi
     ------ ------
~w2 | n_io | n_oo |
     ------ ------
     = n_ix        TOTAL = n_xx

Methods

chi_sq(n_ii, n_ix_xi_tuple, n_xx) Scores bigrams using chi-square, i.e.
dice(n_ii, n_ix_xi_tuple, n_xx) Scores bigrams using Dice’s coefficient.
fisher(*marginals) Scores bigrams using Fisher’s Exact Test (Pedersen 1996).
jaccard(*marginals) Scores ngrams using the Jaccard index.
likelihood_ratio(*marginals) Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.
mi_like(*marginals, **kwargs) Scores ngrams using a variant of mutual information.
phi_sq(*marginals) Scores bigrams using phi-square, the square of the Pearson correlation coefficient.
pmi(*marginals) Scores ngrams by pointwise mutual information, as in Manning and Schutze 5.4.
poisson_stirling(*marginals) Scores ngrams using the Poisson-Stirling measure.
raw_freq(*marginals) Scores ngrams by their frequency
student_t(*marginals) Scores ngrams using Student’s t test with independence hypothesis for unigrams, as in Manning and Schutze 5.3.1.