nltk.regexp_span_tokenize()
¶
-
nltk.
regexp_span_tokenize
(s, regexp)[source]¶ Return the offsets of the tokens in s, as a sequence of
(start, end)
tuples, by splitting the string at each successive match of regexp.>>> from nltk.tokenize.util import regexp_span_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> list(regexp_span_tokenize(s, r'\s')) [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
Parameters: - s (str) – the string to be tokenized
- regexp (str) – regular expression that matches token separators (must not be empty)
Return type: iter(tuple(int, int))