class gensim.corpora.Dictionary(documents=None, prune_at=2000000)[source]

Dictionary encapsulates the mapping between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.


__init__([documents, prune_at]) If documents are given, use them to initialize Dictionary (see add_documents()).
add_documents(documents[, prune_at]) Update dictionary from a collection of documents.
compactify() Assign new word ids to all words.
doc2bow(document[, allow_update, return_missing]) Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples.
filter_extremes([no_below, no_above, keep_n]) Filter out tokens that appear in
filter_n_most_frequent(remove_n) Filter out the ‘remove_n’ most frequent tokens that appear in the documents.
filter_tokens([bad_ids, good_ids]) Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.
from_corpus(corpus[, id2word]) Create Dictionary from an existing corpus.
get((k[,d]) -> D[k] if k in D, ...)
items(() -> list of D’s (key, value) pairs, ...)
iteritems(() -> an iterator over the (key, ...)
iterkeys(() -> an iterator over the keys of D)
keys() Return a list of all token ids.
load(fname[, mmap]) Load a previously saved object from file (also see save).
load_from_text(fname) Load a previously stored Dictionary from a text file.
merge_with(other) Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids.
save(fname_or_handle[, separately, ...]) Save the object to file (also see load).
save_as_text(fname[, sort_by_word]) Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].
values(() -> list of D’s values)