`nltk.tokenize.MWETokenizer`¶

class nltk.tokenize.MWETokenizer(mwes=None, separator='_')[source]¶: A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

Methods¶

__init__([mwes, separator]) Initialize the multi-word tokenizer with a list of expressions and a

add_mwe(mwe) Add a multi-word expression to the lexicon (stored as a word trie)

span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.

param text:	A list containing tokenized text

tokenize_sents(strings) Apply self.tokenize() to each element of strings.