gensim.utils

This module contains various general utility functions.

Functions

any2unicode(text[, encoding, errors]) Convert a string (bytestring in encoding or unicode), to unicode.
any2utf8(text[, errors, encoding]) Convert a string (unicode or bytestring in encoding), to bytestring in utf8.
check_output(*popenargs, **kwargs) Run command with arguments and return its output as a byte string.
chunkize(corpus, chunksize[, maxsize, as_numpy]) Split a stream of values into smaller chunks.
chunkize_serial(iterable, chunksize[, as_numpy]) Return elements from the iterable in chunksize-ed lists.
contextmanager(func) @contextmanager decorator.
copytree_hardlink(source, dest) Recursively copy a directory ala shutils.copytree, but hardlink files instead of copying.
deaccent(text) Remove accentuation from the given string.
decode_htmlentities(text) Decode HTML entities in text, coded as hex, decimal or named.
dict_from_corpus(corpus) Scan corpus for all word ids that appear in it, then construct and return a mapping which maps each wordId -> str(wordId).
file_or_filename(*args, **kwds) Return a file-like object ready to be read from the beginning.
getNS([host, port, broadcast, hmac_key]) Return a Pyro name server proxy.
get_max_id(corpus) Return the highest feature id that appears in the corpus.
get_my_ip() Try to obtain our external ip (from the pyro nameserver’s point of view)
grouper(iterable, chunksize[, as_numpy]) Return elements from the iterable in chunksize-ed lists.
has_pattern() Function to check if there is installed pattern library
identity(p) Identity fnc, for flows that don’t accept lambda (pickling etc).
is_corpus(obj) Check whether obj is a corpus.
iteritems(d, **kw) Return an iterator over the (key, value) pairs of a dictionary.
keep_vocab_item(word, count, min_count[, ...])
lemmatize(content[, allowed_tags, light, ...]) This function is only available when the optional ‘pattern’ package is installed.
mock_data([n_items, dim, prob_nnz, lam]) Create a random gensim-style corpus, as a list of lists of (int, float) tuples, to be used as a mock corpus.
mock_data_row([dim, prob_nnz, lam]) Create a random gensim sparse vector.
pickle(obj, fname[, protocol]) Pickle object obj to file fname.
prune_vocab(vocab, min_reduce[, trim_rule]) Remove all entries from the vocab dictionary with count smaller than min_reduce.
pyro_daemon(name, obj[, random_suffix, ip, ...]) Register object with name server (starting the name server if not running yet) and block until the daemon is terminated.
qsize(queue) Return the (approximate) queue size where available; -1 where not (OS X).
randfname([prefix])
revdict(d) Reverse a dictionary mapping.
safe_unichr(intval)
simple_preprocess(doc[, deacc, min_len, max_len]) Convert a document into a list of tokens.
smart_extension(fname, ext)
smart_open(uri[, mode]) Open the given S3 / HDFS / filesystem file pointed to by uri for reading or writing.
synchronous(tlockname) A decorator to place an instance-based lock around a method.
to_unicode(text[, encoding, errors]) Convert a string (bytestring in encoding or unicode), to unicode.
to_utf8(text[, errors, encoding]) Convert a string (unicode or bytestring in encoding), to bytestring in utf8.
tokenize(text[, lowercase, deacc, errors, ...]) Iteratively yield tokens as unicode strings, removing accent marks and optionally lowercasing the unidoce string by assigning True to one of the parameters, lowercase, to_lower, or lower.
toptexts(query, texts, index[, n]) Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.
u(s) Text literal
unichr((i) -> Unicode character) Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.
unpickle(fname) Load pickled object from fname
upload_chunked(server, docs[, chunksize, ...]) Memory-friendly upload of documents to a SimServer (or Pyro SimServer proxy).
wraps(wrapped[, assigned, updated]) Decorator factory to apply update_wrapper() to a wrapper function

Classes

ClippedCorpus(corpus[, max_docs])
FakeDict(num_terms) Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).
InputQueue(q, corpus, chunksize, maxsize, ...)
NoCM
RepeatCorpus(corpus, reps) Used in the tutorial on distributed computing and likely not useful anywhere else.
RepeatCorpusNTimes(corpus, n)
SaveLoad Objects which inherit from this class have save/load functions, which un/pickle them to disk.
SlicedCorpus(corpus, slice_)
xrange xrange(start, stop[, step]) -> xrange object