Similarity.__init__(output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')[source]

Construct the index from corpus. The index can be later extended by calling the add_documents method. Note: documents are split (internally, transparently) into shards of shardsize documents each, converted to a matrix, for faster BLAS calls. Each shard is stored to disk under output_prefix.shard_number (=you need write access to that location). If you don’t specify an output prefix, a random filename in temp will be used.

shardsize should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into main memory.

num_features is the number of features in the corpus (e.g. size of the dictionary, or the number of latent topics for latent semantic models).

norm is the user-chosen normalization to use. Accepted values are: ‘l1’ and ‘l2’.

If num_best is left unspecified, similarity queries will return a full vector with one float for every document in the index:

>>> index = Similarity('/path/to/index', corpus, num_features=400) # if corpus has 7 documents...
>>> index[query] # ... then result will have 7 floats
[0.0, 0.0, 0.2, 0.13, 0.8, 0.0, 0.1]

If num_best is set, queries return only the num_best most similar documents, always leaving out documents for which the similarity is 0. If the input vector itself only has features with zero values (=the sparse representation is empty), the returned list will always be empty.

>>> index.num_best = 3
>>> index[query] # return at most "num_best" of `(index_of_document, similarity)` tuples
[(4, 0.8), (2, 0.13), (3, 0.13)]

You can also override num_best dynamically, simply by setting e.g. self.num_best = 10 before doing a query.