nltk.tokenize.RegexpTokenizer
¶
-
class
nltk.tokenize.
RegexpTokenizer
(pattern, gaps=False, discard_empty=True, flags=56)[source]¶ A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters: - pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
- gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.
- discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
- flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
Methods¶
__init__ (pattern[, gaps, discard_empty, flags]) |
|
span_tokenize (text) |
|
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (text) |
|
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |
unicode_repr () |