gensim.corpora.ShardedCorpus.__init__

ShardedCorpus.__init__(output_prefix, corpus, dim=None, shardsize=4096, overwrite=False, sparse_serialization=False, sparse_retrieval=False, gensim=False)[source]

Initializes the dataset. If output_prefix is not found, builds the shards.

Parameters:
  • output_prefix (str) –

    The absolute path to the file from which shard filenames should be derived. The individual shards will be saved as output_prefix.0, output_prefix.1, etc.

    The output_prefix path then works as the filename to which the ShardedCorpus object itself will be automatically saved. Normally, gensim corpora do not do this, but ShardedCorpus needs to remember several serialization settings: namely the shard size and whether it was serialized in dense or sparse format. By saving automatically, any new ShardedCorpus with the same output_prefix will be able to find the information about the data serialized with the given prefix.

    If you want to overwrite your data serialized with some output prefix, set the overwrite flag to True.

    Of course, you can save your corpus separately as well using the save() method.

  • corpus (gensim.interfaces.CorpusABC) – The source corpus from which to build the dataset.
  • dim (int) – Specify beforehand what the dimension of a dataset item should be. This is useful when initializing from a corpus that doesn’t advertise its dimension, or when it does and you want to check that the corpus matches the expected dimension. If `dim` is left unused and `corpus` does not provide its dimension in an expected manner, initialization will fail.
  • shardsize (int) – How many data points should be in one shard. More data per shard means less shard reloading but higher memory usage and vice versa.
  • overwrite (bool) – If set, will build dataset from given corpus even if output_prefix already exists.
  • sparse_serialization (bool) –

    If set, will save the data in a sparse form (as csr matrices). This is to speed up retrieval when you know you will be using sparse matrices.

    ..note:

    This property **should not change** during the lifetime of
    the dataset. (If you find out you need to change from a sparse
    to a dense representation, the best practice is to create
    another ShardedCorpus object.)
    
  • sparse_retrieval (bool) –

    If set, will retrieve data as sparse vectors (numpy csr matrices). If unset, will return ndarrays.

    Note that retrieval speed for this option depends on how the dataset was serialized. If sparse_serialization was set, then setting sparse_retrieval will be faster. However, if the two settings do not correspond, the conversion on the fly will slow the dataset down.

  • gensim (bool) – If set, will convert the output to gensim sparse vectors (list of tuples (id, value)) to make it behave like any other gensim corpus. This will slow the dataset down.