nltk.tag
¶
NLTK Taggers
This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A “tag” is a case-sensitive string that specifies some property of a token,
such as its part of speech. Tagged tokens are encoded as tuples
(tag, token)
. For example, the following tagged token combines
the word 'fly'
with a noun part of speech tag ('NN'
):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
... print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None
Note that words that the tagger has not seen during training receive a tag
of None
.
We evaluate a tagger on data that was not seen during training:
>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...
For more information, please consult chapter 5 of the NLTK Book.
Functions¶
load (resource_url[, format, cache, verbose, ...]) |
Load a given resource from the NLTK data package. |
map_tag (source, target, source_tag) |
Maps the tag from the source tagset to the target tagset. |
pos_tag (tokens[, tagset]) |
Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens. |
pos_tag_sents (sentences[, tagset]) |
Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens. |
str2tuple (s[, sep]) |
Given the string representation of a tagged token, return the corresponding tuple representation. |
tagset_mapping (source, target) |
Retrieve the mapping dictionary between tagsets. |
tuple2str (tagged_token[, sep]) |
Given the tuple representation of a tagged token, return the corresponding string representation. |
untag (tagged_sentence) |
Given a tagged sentence, return an untagged version of that sentence. |
Classes¶
AffixTagger ([train, model, affix_length, ...]) |
A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. |
BigramTagger ([train, model, backoff, ...]) |
A tagger that chooses a token’s tag based its word string and on the preceding words’ tag. |
BrillTagger (initial_tagger, rules[, ...]) |
Brill’s transformational rule-based tagger. |
BrillTaggerTrainer (initial_tagger, templates) |
A trainer for tbl taggers. |
CRFTagger ([feature_func, verbose, training_opt]) |
A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite |
ClassifierBasedPOSTagger ([feature_detector, ...]) |
A classifier based part of speech tagger. |
ClassifierBasedTagger ([feature_detector, ...]) |
A sequential tagger that uses a classifier to choose the tag for each token in a sentence. |
ContextTagger (context_to_tag[, backoff]) |
An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”. |
DefaultTagger (tag) |
A tagger that assigns the same tag to every token. |
HiddenMarkovModelTagger (symbols, states, ...) |
Hidden Markov model class, a generative model for labelling sequence data. |
HiddenMarkovModelTrainer ([states, symbols]) |
Algorithms for learning HMM parameters from training data. |
HunposTagger (path_to_model[, path_to_bin, ...]) |
A class for pos tagging with HunPos. |
NgramTagger (n[, train, model, backoff, ...]) |
A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. |
PerceptronTagger ([load]) |
Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal. |
RegexpTagger (regexps[, backoff]) |
Regular Expression Tagger |
SennaChunkTagger (path[, encoding]) |
|
SennaNERTagger (path[, encoding]) |
|
SennaTagger (path[, encoding]) |
|
SequentialBackoffTagger ([backoff]) |
An abstract base class for taggers that tags words sequentially, left to right. |
StanfordNERTagger (*args, **kwargs) |
A class for Named-Entity Tagging with Stanford Tagger. |
StanfordPOSTagger (*args, **kwargs) |
A class for pos tagging with Stanford Tagger. |
StanfordTagger (model_filename[, ...]) |
An interface to Stanford taggers. |
TaggerI |
A processing interface for assigning a tag to each token in a list. |
TnT ([unk, Trained, N, C]) |
TnT - Statistical POS tagger |
TrigramTagger ([train, model, backoff, ...]) |
A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags. |
UnigramTagger ([train, model, backoff, ...]) |
Unigram Tagger |