gensim.corpora.Dictionary.doc2bow

Dictionary.doc2bow(document, allow_update=False, return_missing=False)[source]

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.