gensim.models.Word2Vec.accuracy¶

Word2Vec.accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar>, case_insensitive=True)[source]¶

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by ”: SECTION NAME” lines. See questions-words.txt in https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word not in the first restrict_vocab words (default 30,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. In case case_insensitive is True, the first restrict_vocab words are taken first, and then case normalization is performed.

Use case_insensitive to convert all words in questions and vocab to their uppercase form before evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens and question words. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

This method corresponds to the compute-accuracy script of the original C word2vec.