gensim.models.TfidfModel.__init__¶
-
TfidfModel.
__init__
(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True)[source]¶ Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:
weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})
or, more generally:
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
so you can plug in your own custom wlocal and wglobal functions.
Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.
normalize dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
If dictionary is specified, it must be a corpora.Dictionary object and it will be used to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).