`nltk.tokenize.WordPunctTokenizer`¶

class nltk.tokenize.WordPunctTokenizer[source]¶

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

>>> from nltk.tokenize import WordPunctTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Methods¶

`__init__`()
`span_tokenize`(text)
`span_tokenize_sents`(strings)	Apply `self.span_tokenize()` to each element of `strings`.
`tokenize`(text)
`tokenize_sents`(strings)	Apply `self.tokenize()` to each element of `strings`.
`unicode_repr`()