TfidfModel.__init__(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True)[source]

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:

weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})

or, more generally:

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

normalize dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.

If dictionary is specified, it must be a corpora.Dictionary object and it will be used to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).