nltk.tag.RegexpTagger
¶
-
class
nltk.tag.
RegexpTagger
(regexps, backoff=None)[source]¶ Regular Expression Tagger
The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:
>>> from nltk.corpus import brown >>> from nltk.tag import RegexpTagger >>> test_sent = brown.sents(categories='news')[0] >>> regexp_tagger = RegexpTagger( ... [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives ... (r'.*ly$', 'RB'), # adverbs ... (r'.*s$', 'NNS'), # plural nouns ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # past tense verbs ... (r'.*', 'NN') # nouns (default) ... ]) >>> regexp_tagger <Regexp Tagger: size=9> >>> regexp_tagger.tag(test_sent) [('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'), ('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'), ("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'), ('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'), ('place', 'NN'), ('.', 'NN')]
Parameters: regexps (list(tuple(str, str))) – A list of (regexp, tag)
pairs, each of which indicates that a word matchingregexp
should be tagged withtag
. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.
Methods¶
__init__ (regexps[, backoff]) |
|
choose_tag (tokens, index, history) |
|
decode_json_obj (obj) |
|
encode_json_obj () |
|
evaluate (gold) |
Score the accuracy of the tagger against the gold standard. |
tag (tokens) |
|
tag_one (tokens, index, history) |
Determine an appropriate tag for the specified token, and return that tag. |
tag_sents (sentences) |
Apply self.tag() to each element of sentences. |
unicode_repr () |