nltk.tokenize.WordPunctTokenizer
¶
-
class
nltk.tokenize.
WordPunctTokenizer
[source]¶ Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp
\w+|[^\w\s]+
.>>> from nltk.tokenize import WordPunctTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> WordPunctTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Methods¶
__init__ () |
|
span_tokenize (text) |
|
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (text) |
|
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |
unicode_repr () |