nltk.SExprTokenizer
¶
-
class
nltk.
SExprTokenizer
(parens='()', strict=True)[source]¶ A tokenizer that divides strings into s-expressions. An s-expresion can be either:
- a parenthesized expression, including any nested parenthesized expressions, or
- a sequence of non-whitespace non-parenthesis characters.
For example, the string
(a (b c)) d e (f)
consists of four s-expressions:(a (b c))
,d
,e
, and(f)
.By default, the characters
(
and)
are treated as open and close parentheses, but alternative strings may be specified.Parameters: - parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
- strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
Methods¶
__init__ ([parens, strict]) |
|
span_tokenize (s) |
Identify the tokens using integer offsets (start_i, end_i) , where s[start_i:end_i] is the corresponding token. |
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (text) |
Return a list of s-expressions extracted from text. |
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |