nltk.TreebankWordTokenizer
¶
-
class
nltk.
TreebankWordTokenizer
[source]¶ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by
word_tokenize()
. It assumes that the text has already been segmented into sentences, e.g. usingsent_tokenize()
.This tokenizer performs the following steps:
split standard contractions, e.g.
don't
->do n't
andthey'll
->they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace
separate periods that appear at the end of line
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> TreebankWordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] >>> s = "They'll save and invest more." >>> TreebankWordTokenizer().tokenize(s) ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] >>> s = "hi, my name can't hello," >>> TreebankWordTokenizer().tokenize(s) ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
Methods¶
span_tokenize (s) |
Identify the tokens using integer offsets (start_i, end_i) , where s[start_i:end_i] is the corresponding token. |
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (text) |
|
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |