
This package contains implementations of various streaming corpus I/O format.


BleiCorpus(fname[, fname_vocab]) Corpus in Blei’s LDA-C format.
Dictionary([documents, prune_at]) Dictionary encapsulates the mapping between normalized words and their integer ids.
HashDictionary([documents, id_range, ...]) HashDictionary encapsulates the mapping between normalized words and their integer ids.
IndexedCorpus(fname[, index_fname])
LowCorpus(fname[, id2word, line2words]) List_Of_Words corpus handles input in GibbsLda++ format.
MalletCorpus(fname[, id2word, metadata]) Quoting
MmCorpus(fname) Corpus in the Matrix Market format.
ShardedCorpus(output_prefix, corpus[, dim, ...]) This corpus is designed for situations where you need to train a model on matrices, with a large number of iterations.
SvmLightCorpus(fname[, store_labels]) Corpus in SVMlight format.
TextCorpus([input]) Helper class to simplify the pipeline of getting bag-of-words vectors (= a gensim corpus) from plain text.
UciCorpus(fname[, fname_vocab]) Corpus in the UCI bag-of-words format.
WikiCorpus(fname[, processes, lemmatize, ...]) Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.