nltk.tag.TnT

class nltk.tag.TnT(unk=None, Trained=False, N=1000, C=False)[source]

TnT - Statistical POS tagger

IMPORTANT NOTES:

  • DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
    • It is possible to provide an untrained POS tagger to create tags for unknown words, see __init__ function
  • SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
    • Due to the nature of this tagger, it works best when trained over sentence delimited input.
    • However it still produces good results if the training data and testing data are separated on all punctuation eg: [,.?!]
    • Input for training is expected to be a list of sentences where each sentence is a list of (word, tag) tuples
    • Input for tag function is a single sentence Input for tagdata function is a list of sentences Output is of a similar form
  • Function provided to process text that is unsegmented
    • Please see basic_sent_chop()

TnT uses a second order Markov model to produce tags for a sequence of input, specifically:

argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)

IE: the maximum projection of a set of probabilities

The set of possible tags for a given word is derived from the training data. It is the set of all tags that exact word has been assigned.

To speed up and get more precision, we can use log addition to instead multiplication, specifically:

argmax [Sigma(log(P(t_i|t_i-1,t_i-2))+log(P(w_i|t_i)))] +
log(P(t_T+1|t_T))

The probability of a tag for a given word is the linear interpolation of 3 markov models; a zero-order, first-order, and a second order model.

P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
l3*P(t_i| t_i-1, t_i-2)

A beam search is used to limit the memory usage of the algorithm. The degree of the beam can be changed using N in the initialization. N represents the maximum number of possible solutions to maintain while tagging.

It is possible to differentiate the tags which are assigned to capitalized words. However this does not result in a significant gain in the accuracy of the results.

Methods

__init__([unk, Trained, N, C]) Construct a TnT statistical tagger.
evaluate(gold) Score the accuracy of the tagger against the gold standard.
tag(data) Tags a single sentence
tag_sents(sentences) Apply self.tag() to each element of sentences.
tagdata(data) Tags each sentence in a list of sentences
train(data) Uses a set of tagged data to train the tagger.