gensim.similarities.Similarity.__init__¶
-
Similarity.
__init__
(output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')[source]¶ Construct the index from corpus. The index can be later extended by calling the add_documents method. Note: documents are split (internally, transparently) into shards of shardsize documents each, converted to a matrix, for faster BLAS calls. Each shard is stored to disk under output_prefix.shard_number (=you need write access to that location). If you don’t specify an output prefix, a random filename in temp will be used.
shardsize should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into main memory.
num_features is the number of features in the corpus (e.g. size of the dictionary, or the number of latent topics for latent semantic models).
norm is the user-chosen normalization to use. Accepted values are: ‘l1’ and ‘l2’.
If num_best is left unspecified, similarity queries will return a full vector with one float for every document in the index:
>>> index = Similarity('/path/to/index', corpus, num_features=400) # if corpus has 7 documents... >>> index[query] # ... then result will have 7 floats [0.0, 0.0, 0.2, 0.13, 0.8, 0.0, 0.1]
If num_best is set, queries return only the num_best most similar documents, always leaving out documents for which the similarity is 0. If the input vector itself only has features with zero values (=the sparse representation is empty), the returned list will always be empty.
>>> index.num_best = 3 >>> index[query] # return at most "num_best" of `(index_of_document, similarity)` tuples [(4, 0.8), (2, 0.13), (3, 0.13)]
You can also override num_best dynamically, simply by setting e.g. self.num_best = 10 before doing a query.