gensim.corpora.LowCorpus
¶
-
class
gensim.corpora.
LowCorpus
(fname, id2word=None, line2words=<function split_on_space>)[source]¶ List_Of_Words corpus handles input in GibbsLda++ format.
Quoting http://gibbslda.sourceforge.net/#3.2_Input_Data_Format:
Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows: [M] [document1] [document2] ... [documentM] in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms. [documenti] = [wordi1] [wordi2] ... [wordiNi] in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.
Methods¶
__init__ (fname[, id2word, line2words]) |
Initialize the corpus from a file. |
docbyoffset (offset) |
Return the document stored at file position offset. |
line2doc (line) |
|
load (fname[, mmap]) |
Load a previously saved object from file (also see save). |
save (*args, **kwargs) |
|
save_corpus (fname, corpus[, id2word, metadata]) |
Save a corpus in the List-of-words format. |
serialize (serializer, fname, corpus[, ...]) |
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. |