nltk.tokenize.TreebankWordTokenizer¶
-
class
nltk.tokenize.TreebankWordTokenizer[source]¶ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by
word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. usingsent_tokenize().This tokenizer performs the following steps:
split standard contractions, e.g.
don't->do n'tandthey'll->they 'lltreat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace
separate periods that appear at the end of line
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> TreebankWordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] >>> s = "They'll save and invest more." >>> TreebankWordTokenizer().tokenize(s) ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] >>> s = "hi, my name can't hello," >>> TreebankWordTokenizer().tokenize(s) ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
Methods¶
span_tokenize(s) |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
span_tokenize_sents(strings) |
Apply self.span_tokenize() to each element of strings. |
tokenize(text) |
|
tokenize_sents(strings) |
Apply self.tokenize() to each element of strings. |