nltk.tokenize
¶
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with s.decode("utf8")
.
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
Functions¶
casual_tokenize (text[, preserve_case, ...]) |
Convenience function for wrapping the tokenizer. |
line_tokenize (text[, blanklines]) |
|
load (resource_url[, format, cache, verbose, ...]) |
Load a given resource from the NLTK data package. |
regexp_span_tokenize (s, regexp) |
Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp. |
regexp_tokenize (text, pattern[, gaps, ...]) |
Return a tokenized copy of text. |
sent_tokenize (text[, language]) |
Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language). |
string_span_tokenize (s, sep) |
Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep. |
word_tokenize (text[, language]) |
Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language). |
Classes¶
BlanklineTokenizer () |
Tokenize a string, treating any sequence of blank lines as a delimiter. |
LineTokenizer ([blanklines]) |
Tokenize a string into its lines, optionally discarding blank lines. |
MWETokenizer ([mwes, separator]) |
A tokenizer that processes tokenized text and merges multi-word expressions into single tokens. |
PunktSentenceTokenizer ([train_text, ...]) |
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. |
RegexpTokenizer (pattern[, gaps, ...]) |
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens. |
SExprTokenizer ([parens, strict]) |
A tokenizer that divides strings into s-expressions. |
SpaceTokenizer |
Tokenize a string using the space character as a delimiter, which is the same as s.split(' ') . |
StanfordTokenizer ([path_to_jar, encoding, ...]) |
Interface to the Stanford Tokenizer |
TabTokenizer |
Tokenize a string use the tab character as a delimiter, the same as s.split('\t') . |
TextTilingTokenizer ([w, k, ...]) |
Tokenize a document into topical sections using the TextTiling algorithm. |
TreebankWordTokenizer |
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. |
TweetTokenizer ([preserve_case, reduce_len, ...]) |
Tokenizer for tweets. |
WhitespaceTokenizer () |
Tokenize a string on whitespace (space, tab, newline). |
WordPunctTokenizer () |
Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+ . |