nltk.tag

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available. It uses the Penn Treebank tagset:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...

For more information, please consult chapter 5 of the NLTK Book.

Functions

load(resource_url[, format, cache, verbose, ...]) Load a given resource from the NLTK data package.
map_tag(source, target, source_tag) Maps the tag from the source tagset to the target tagset.
pos_tag(tokens[, tagset]) Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.
pos_tag_sents(sentences[, tagset]) Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
str2tuple(s[, sep]) Given the string representation of a tagged token, return the corresponding tuple representation.
tagset_mapping(source, target) Retrieve the mapping dictionary between tagsets.
tuple2str(tagged_token[, sep]) Given the tuple representation of a tagged token, return the corresponding string representation.
untag(tagged_sentence) Given a tagged sentence, return an untagged version of that sentence.

Classes

AffixTagger([train, model, affix_length, ...]) A tagger that chooses a token’s tag based on a leading or trailing substring of its word string.
BigramTagger([train, model, backoff, ...]) A tagger that chooses a token’s tag based its word string and on the preceding words’ tag.
BrillTagger(initial_tagger, rules[, ...]) Brill’s transformational rule-based tagger.
BrillTaggerTrainer(initial_tagger, templates) A trainer for tbl taggers.
CRFTagger([feature_func, verbose, training_opt]) A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite
ClassifierBasedPOSTagger([feature_detector, ...]) A classifier based part of speech tagger.
ClassifierBasedTagger([feature_detector, ...]) A sequential tagger that uses a classifier to choose the tag for each token in a sentence.
ContextTagger(context_to_tag[, backoff]) An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”.
DefaultTagger(tag) A tagger that assigns the same tag to every token.
HiddenMarkovModelTagger(symbols, states, ...) Hidden Markov model class, a generative model for labelling sequence data.
HiddenMarkovModelTrainer([states, symbols]) Algorithms for learning HMM parameters from training data.
HunposTagger(path_to_model[, path_to_bin, ...]) A class for pos tagging with HunPos.
NgramTagger(n[, train, model, backoff, ...]) A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags.
PerceptronTagger([load]) Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.
RegexpTagger(regexps[, backoff]) Regular Expression Tagger
SennaChunkTagger(path[, encoding])
SennaNERTagger(path[, encoding])
SennaTagger(path[, encoding])
SequentialBackoffTagger([backoff]) An abstract base class for taggers that tags words sequentially, left to right.
StanfordNERTagger(*args, **kwargs) A class for Named-Entity Tagging with Stanford Tagger.
StanfordPOSTagger(*args, **kwargs) A class for pos tagging with Stanford Tagger.
StanfordTagger(model_filename[, ...]) An interface to Stanford taggers.
TaggerI A processing interface for assigning a tag to each token in a list.
TnT([unk, Trained, N, C]) TnT - Statistical POS tagger
TrigramTagger([train, model, backoff, ...]) A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags.
UnigramTagger([train, model, backoff, ...]) Unigram Tagger