nltk.SExprTokenizer

class nltk.SExprTokenizer(parens='()', strict=True)[source]

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

  • a parenthesized expression, including any nested parenthesized expressions, or
  • a sequence of non-whitespace non-parenthesis characters.

For example, the string (a (b c)) d e (f) consists of four s-expressions: (a (b c)), d, e, and (f).

By default, the characters ( and ) are treated as open and close parentheses, but alternative strings may be specified.

Parameters:
  • parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
  • strict – If true, then raise an exception when tokenizing an ill-formed sexpr.

Methods

__init__([parens, strict])
span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(text) Return a list of s-expressions extracted from text.
tokenize_sents(strings) Apply self.tokenize() to each element of strings.