nltk.StanfordTokenizer
¶
-
class
nltk.
StanfordTokenizer
(path_to_jar=None, encoding=u'utf8', options=None, verbose=False, java_options=u'-mx1000m')[source]¶ Interface to the Stanford Tokenizer
>>> from nltk.tokenize import StanfordTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks." >>> StanfordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] >>> s = "The colour of the wall is blue." >>> StanfordTokenizer(options={"americanize": True}).tokenize(s) ['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
Methods¶
__init__ ([path_to_jar, encoding, options, ...]) |
|
span_tokenize (s) |
Identify the tokens using integer offsets (start_i, end_i) , where s[start_i:end_i] is the corresponding token. |
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (s) |
Use stanford tokenizer’s PTBTokenizer to tokenize multiple sentences. |
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |