gensim.models.LdaModel.init¶

LdaModel.__init__(corpus=None, num_topics=100, id2word=None, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, ns_conf={})[source]¶

If given, start training from the iterable corpus straight away. If not given, the model is left untrained (presumably because you want to call update() manually).

num_topics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

alpha and eta are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. Both default to a symmetric 1.0/num_topics prior.

alpha can be set to an explicit array = prior of your choice. It also support special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

eta can be a scalar for a symmetric prior over topic/word distributions, or a matrix of shape num_topics x num_words, which can be used to impose asymmetric priors over the word distribution on a per-topic basis. This may be useful if you want to seed certain topics with particular words by boosting the priors for those words. It also supports the special value ‘auto’, which learns an asymmetric prior directly from your data.

Turn on distributed to force distributed computing (see the web tutorial on how to set up a cluster of machines for gensim).

Calculate and log perplexity estimate from the latest mini-batch every eval_every model updates (setting this to 1 slows down training ~2x; default is 10 for better performance). Set to None to disable perplexity estimation.

decay and offset parameters are the same as Kappa and Tau_0 in Hoffman et al, respectively.

minimum_probability controls filtering the topics returned for a document (bow).

random_state can be a numpy.random.RandomState object or the seed for one

Example:

>>> lda = LdaModel(corpus, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document
>>> lda.update(corpus2) # update the LDA model with additional documents
>>> print(lda[doc_bow])

>>> lda = LdaModel(corpus, num_topics=50, alpha='auto', eval_every=5)  # train asymmetric alpha from data

gensim.models.LdaModel.__init__¶

gensim.models.LdaModel.init¶