gensim.interfaces.CorpusABC

class gensim.interfaces.CorpusABC[source]

Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:

>>> for doc in corpus:
>>>     # do something with the doc...

A document is a sequence of (fieldId, fieldValue) 2-tuples:

>>> for attr_id, attr_value in doc:
>>>     # do something with the attribute

Note that although a default len() method is provided, it is very inefficient (performs a linear scan through the corpus to determine its length). Wherever the corpus size is needed and known in advance (or at least doesn’t change so that it can be cached), the len() method should be overridden.

See the gensim.corpora.svmlightcorpus module for an example of a corpus.

Saving the corpus with the save method (inherited from utils.SaveLoad) will only store the in-memory (binary, pickled) object representation=the stream state, and not the documents themselves. See the save_corpus static method for serializing the actual stream content.

Methods

load(fname[, mmap]) Load a previously saved object from file (also see save).
save(*args, **kwargs)
save_corpus(fname, corpus[, id2word, metadata]) Save an existing corpus to disk.