nltk.tokenize.StanfordTokenizer

class nltk.tokenize.StanfordTokenizer(path_to_jar=None, encoding=u'utf8', options=None, verbose=False, java_options=u'-mx1000m')[source]

Interface to the Stanford Tokenizer

>>> from nltk.tokenize import StanfordTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> StanfordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue."
>>> StanfordTokenizer(options={"americanize": True}).tokenize(s)
['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']

Methods

__init__([path_to_jar, encoding, options, ...])
span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(s) Use stanford tokenizer’s PTBTokenizer to tokenize multiple sentences.
tokenize_sents(strings) Apply self.tokenize() to each element of strings.