`gensim.corpora`¶

This package contains implementations of various streaming corpus I/O format.

Classes¶

`BleiCorpus`(fname[, fname_vocab])	Corpus in Blei’s LDA-C format.
`Dictionary`([documents, prune_at])	Dictionary encapsulates the mapping between normalized words and their integer ids.
`HashDictionary`([documents, id_range, ...])	HashDictionary encapsulates the mapping between normalized words and their integer ids.
`IndexedCorpus`(fname[, index_fname])
`LowCorpus`(fname[, id2word, line2words])	List_Of_Words corpus handles input in GibbsLda++ format.
`MalletCorpus`(fname[, id2word, metadata])	Quoting http://mallet.cs.umass.edu/import.php:
`MmCorpus`(fname)	Corpus in the Matrix Market format.
`ShardedCorpus`(output_prefix, corpus[, dim, ...])	This corpus is designed for situations where you need to train a model on matrices, with a large number of iterations.
`SvmLightCorpus`(fname[, store_labels])	Corpus in SVMlight format.
`TextCorpus`([input])	Helper class to simplify the pipeline of getting bag-of-words vectors (= a gensim corpus) from plain text.
`UciCorpus`(fname[, fname_vocab])	Corpus in the UCI bag-of-words format.
`WikiCorpus`(fname[, processes, lemmatize, ...])	Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.