nltk.string_span_tokenize()
¶
-
nltk.
string_span_tokenize
(s, sep)[source]¶ Return the offsets of the tokens in s, as a sequence of
(start, end)
tuples, by splitting the string at each occurrence of sep.>>> from nltk.tokenize.util import string_span_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> list(string_span_tokenize(s, " ")) [(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37), (38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
Parameters: - s (str) – the string to be tokenized
- sep (str) – the token separator
Return type: iter(tuple(int, int))