gensim.corpora.Dictionary¶
-
class
gensim.corpora.Dictionary(documents=None, prune_at=2000000)[source]¶ Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
Methods¶
__init__([documents, prune_at]) |
If documents are given, use them to initialize Dictionary (see add_documents()). |
add_documents(documents[, prune_at]) |
Update dictionary from a collection of documents. |
compactify() |
Assign new word ids to all words. |
doc2bow(document[, allow_update, return_missing]) |
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. |
filter_extremes([no_below, no_above, keep_n]) |
Filter out tokens that appear in |
filter_n_most_frequent(remove_n) |
Filter out the ‘remove_n’ most frequent tokens that appear in the documents. |
filter_tokens([bad_ids, good_ids]) |
Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest. |
from_corpus(corpus[, id2word]) |
Create Dictionary from an existing corpus. |
from_documents(documents) |
|
get((k[,d]) -> D[k] if k in D, ...) |
|
items(() -> list of D’s (key, value) pairs, ...) |
|
iteritems(() -> an iterator over the (key, ...) |
|
iterkeys(() -> an iterator over the keys of D) |
|
itervalues(...) |
|
keys() |
Return a list of all token ids. |
load(fname[, mmap]) |
Load a previously saved object from file (also see save). |
load_from_text(fname) |
Load a previously stored Dictionary from a text file. |
merge_with(other) |
Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids. |
save(fname_or_handle[, separately, ...]) |
Save the object to file (also see load). |
save_as_text(fname[, sort_by_word]) |
Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE]. |
values(() -> list of D’s values) |