`nltk.tokenize.TreebankWordTokenizer`¶

class nltk.tokenize.TreebankWordTokenizer[source]¶

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

split standard contractions, e.g. don't -> do n't and they'll -> they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace

separate periods that appear at the end of line

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> TreebankWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
>>> s = "They'll save and invest more."
>>> TreebankWordTokenizer().tokenize(s)
['They', "'ll", 'save', 'and', 'invest', 'more', '.']
>>> s = "hi, my name can't hello,"
>>> TreebankWordTokenizer().tokenize(s)
['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']

Methods¶

`span_tokenize`(s)	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
`span_tokenize_sents`(strings)	Apply `self.span_tokenize()` to each element of `strings`.
`tokenize`(text)
`tokenize_sents`(strings)	Apply `self.tokenize()` to each element of `strings`.

Attributes¶

`CONTRACTIONS2`
`CONTRACTIONS3`
`CONTRACTIONS4`
`ENDING_QUOTES`
`PARENS_BRACKETS`
`PUNCTUATION`
`STARTING_QUOTES`