LdaMulticore.__init__(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)[source]

If given, start training from the iterable corpus straight away. If not given, the model is left untrained (presumably because you want to call update() manually).

num_topics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

workers is the number of extra processes to use for parallelization. Uses all available cores by default: workers=cpu_count()-1. Note: for hyper-threaded CPUs, cpu_count() returns a useless number – set workers directly to the number of your real cores (not hyperthreads) minus one, for optimal performance.

If batch is not set, perform online training by updating the model once every workers * chunksize documents (online training). Otherwise, run batch LDA, updating model only once at the end of each full corpus pass.

alpha and eta are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. Both default to a symmetric 1.0/num_topics prior.

alpha can be set to an explicit array = prior of your choice. It also support special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

eta can be a scalar for a symmetric prior over topic/word distributions, or a matrix of shape num_topics x num_words, which can be used to impose asymmetric priors over the word distribution on a per-topic basis. This may be useful if you want to seed certain topics with particular words by boosting the priors for those words.

Calculate and log perplexity estimate from the latest mini-batch once every eval_every documents. Set to None to disable perplexity estimation (faster), or to 0 to only evaluate perplexity once, at the end of each corpus pass.

decay and offset parameters are the same as Kappa and Tau_0 in Hoffman et al, respectively.

random_state can be a numpy.random.RandomState object or the seed for one


>>> lda = LdaMulticore(corpus, id2word=id2word, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document
>>> lda.update(corpus2) # update the LDA model with additional documents
>>> print(lda[doc_bow])