nltk.regexp_span_tokenize()

nltk.regexp_span_tokenize(s, regexp)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp.

>>> from nltk.tokenize.util import regexp_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(regexp_span_tokenize(s, r'\s'))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • regexp (str) – regular expression that matches token separators (must not be empty)
Return type:

iter(tuple(int, int))