`nltk.tokenize`¶

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

Functions¶

`casual_tokenize`(text[, preserve_case, ...])	Convenience function for wrapping the tokenizer.
`line_tokenize`(text[, blanklines])
`load`(resource_url[, format, cache, verbose, ...])	Load a given resource from the NLTK data package.
`regexp_span_tokenize`(s, regexp)	Return the offsets of the tokens in s, as a sequence of `(start, end)` tuples, by splitting the string at each successive match of regexp.
`regexp_tokenize`(text, pattern[, gaps, ...])	Return a tokenized copy of text.
`sent_tokenize`(text[, language])	Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently `PunktSentenceTokenizer` for the specified language).
`string_span_tokenize`(s, sep)	Return the offsets of the tokens in s, as a sequence of `(start, end)` tuples, by splitting the string at each occurrence of sep.
`word_tokenize`(text[, language])	Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently `TreebankWordTokenizer` along with `PunktSentenceTokenizer` for the specified language).

Classes¶

`BlanklineTokenizer`()	Tokenize a string, treating any sequence of blank lines as a delimiter.
`LineTokenizer`([blanklines])	Tokenize a string into its lines, optionally discarding blank lines.
`MWETokenizer`([mwes, separator])	A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
`PunktSentenceTokenizer`([train_text, ...])	A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
`RegexpTokenizer`(pattern[, gaps, ...])	A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
`SExprTokenizer`([parens, strict])	A tokenizer that divides strings into s-expressions.
`SpaceTokenizer`	Tokenize a string using the space character as a delimiter, which is the same as `s.split(' ')`.
`StanfordTokenizer`([path_to_jar, encoding, ...])	Interface to the Stanford Tokenizer
`TabTokenizer`	Tokenize a string use the tab character as a delimiter, the same as `s.split('\t')`.
`TextTilingTokenizer`([w, k, ...])	Tokenize a document into topical sections using the TextTiling algorithm.
`TreebankWordTokenizer`	The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
`TweetTokenizer`([preserve_case, reduce_len, ...])	Tokenizer for tweets.
`WhitespaceTokenizer`()	Tokenize a string on whitespace (space, tab, newline).
`WordPunctTokenizer`()	Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp `\w+\|[^\w\s]+`.

nltk.tokenize¶

Functions¶

Classes¶

`nltk.tokenize`¶