nltk.tokenize.MWETokenizer

class nltk.tokenize.MWETokenizer(mwes=None, separator='_')[source]

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

Methods

__init__([mwes, separator]) Initialize the multi-word tokenizer with a list of expressions and a
add_mwe(mwe) Add a multi-word expression to the lexicon (stored as a word trie)
span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(text)
param text:A list containing tokenized text
tokenize_sents(strings) Apply self.tokenize() to each element of strings.