gensim.corpora
¶
This package contains implementations of various streaming corpus I/O format.
Classes¶
BleiCorpus (fname[, fname_vocab]) |
Corpus in Blei’s LDA-C format. |
Dictionary ([documents, prune_at]) |
Dictionary encapsulates the mapping between normalized words and their integer ids. |
HashDictionary ([documents, id_range, ...]) |
HashDictionary encapsulates the mapping between normalized words and their integer ids. |
IndexedCorpus (fname[, index_fname]) |
|
LowCorpus (fname[, id2word, line2words]) |
List_Of_Words corpus handles input in GibbsLda++ format. |
MalletCorpus (fname[, id2word, metadata]) |
Quoting http://mallet.cs.umass.edu/import.php: |
MmCorpus (fname) |
Corpus in the Matrix Market format. |
ShardedCorpus (output_prefix, corpus[, dim, ...]) |
This corpus is designed for situations where you need to train a model on matrices, with a large number of iterations. |
SvmLightCorpus (fname[, store_labels]) |
Corpus in SVMlight format. |
TextCorpus ([input]) |
Helper class to simplify the pipeline of getting bag-of-words vectors (= a gensim corpus) from plain text. |
UciCorpus (fname[, fname_vocab]) |
Corpus in the UCI bag-of-words format. |
WikiCorpus (fname[, processes, lemmatize, ...]) |
Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus. |