gensim.corpora.BleiCorpus

class gensim.corpora.BleiCorpus(fname, fname_vocab=None)[source]

Corpus in Blei’s LDA-C format.

The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.

Each document is one line:

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN

The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.

Methods

__init__(fname[, fname_vocab]) Initialize the corpus from a file.
docbyoffset(offset) Return the document stored at file position offset.
line2doc(line)
load(fname[, mmap]) Load a previously saved object from file (also see save).
save(*args, **kwargs)
save_corpus(fname, corpus[, id2word, metadata]) Save a corpus in the LDA-C format.
serialize(serializer, fname, corpus[, ...]) Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document.