nltk.TreebankWordTokenizer

class nltk.TreebankWordTokenizer[source]

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

  • split standard contractions, e.g. don't -> do n't and they'll -> they 'll

  • treat most punctuation characters as separate tokens

  • split off commas and single quotes, when followed by whitespace

  • separate periods that appear at the end of line

    >>> from nltk.tokenize import TreebankWordTokenizer
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
    >>> TreebankWordTokenizer().tokenize(s)
    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
    >>> s = "They'll save and invest more."
    >>> TreebankWordTokenizer().tokenize(s)
    ['They', "'ll", 'save', 'and', 'invest', 'more', '.']
    >>> s = "hi, my name can't hello,"
    >>> TreebankWordTokenizer().tokenize(s)
    ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
    

Methods

span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(text)
tokenize_sents(strings) Apply self.tokenize() to each element of strings.