nltk.tokenize

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

Functions

casual_tokenize(text[, preserve_case, ...]) Convenience function for wrapping the tokenizer.
line_tokenize(text[, blanklines])
load(resource_url[, format, cache, verbose, ...]) Load a given resource from the NLTK data package.
regexp_span_tokenize(s, regexp) Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp.
regexp_tokenize(text, pattern[, gaps, ...]) Return a tokenized copy of text.
sent_tokenize(text[, language]) Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).
string_span_tokenize(s, sep) Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep.
word_tokenize(text[, language]) Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

Classes

BlanklineTokenizer() Tokenize a string, treating any sequence of blank lines as a delimiter.
LineTokenizer([blanklines]) Tokenize a string into its lines, optionally discarding blank lines.
MWETokenizer([mwes, separator]) A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
PunktSentenceTokenizer([train_text, ...]) A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
RegexpTokenizer(pattern[, gaps, ...]) A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
SExprTokenizer([parens, strict]) A tokenizer that divides strings into s-expressions.
SpaceTokenizer Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').
StanfordTokenizer([path_to_jar, encoding, ...]) Interface to the Stanford Tokenizer
TabTokenizer Tokenize a string use the tab character as a delimiter, the same as s.split('\t').
TextTilingTokenizer([w, k, ...]) Tokenize a document into topical sections using the TextTiling algorithm.
TreebankWordTokenizer The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
TweetTokenizer([preserve_case, reduce_len, ...]) Tokenizer for tweets.
WhitespaceTokenizer() Tokenize a string on whitespace (space, tab, newline).
WordPunctTokenizer() Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.