gensim.utils.tokenize()

gensim.utils.tokenize(text, lowercase=False, deacc=False, errors='strict', to_lower=False, lower=False)[source]

Iteratively yield tokens as unicode strings, removing accent marks and optionally lowercasing the unidoce string by assigning True to one of the parameters, lowercase, to_lower, or lower.

Input text may be either unicode or utf8-encoded byte string.

The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).

>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']