gensim.utils.chunkize()

gensim.utils.chunkize(corpus, chunksize, maxsize=0, as_numpy=False)[source]

Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.

If maxsize > 1, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate process, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium (like harddisk).

If maxsize==0, don’t fool around with parallelism and simply yield the chunksize via chunkize_serial() (no I/O optimizations).

>>> for chunk in chunkize(range(10), 4): print(chunk)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]