gensim.corpora.MalletCorpus

class gensim.corpora.MalletCorpus(fname, id2word=None, metadata=False)[source]

Quoting http://mallet.cs.umass.edu/import.php:

One file, one instance per line Assume the data is in the following format:

[URL] [language] [text of the page...]

Or, more generally,
[document #1 id] [label] [text of the document...] [document #2 id] [label] [text of the document...] ... [document #N id] [label] [text of the document...]

Note that language/label is not considered in Gensim.

Methods

__init__(fname[, id2word, metadata])
docbyoffset(offset) Return the document stored at file position offset.
line2doc(line)
load(fname[, mmap]) Load a previously saved object from file (also see save).
save(*args, **kwargs)
save_corpus(fname, corpus[, id2word, metadata]) Save a corpus in the Mallet format.
serialize(serializer, fname, corpus[, ...]) Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document.

Attributes

id2word