nltk.string_span_tokenize()

nltk.string_span_tokenize(s, sep)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep.

>>> from nltk.tokenize.util import string_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(string_span_tokenize(s, " "))
[(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
(38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • sep (str) – the token separator
Return type:

iter(tuple(int, int))