gensim.corpora.HashDictionary.filter_extremes¶
-
HashDictionary.
filter_extremes
(no_below=5, no_above=0.5, keep_n=100000)[source]¶ Remove document frequency statistics for tokens that appear in
- less than no_below documents (absolute number) or
- more than no_above documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).
Note: since HashDictionary’s id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some supplementary statistics, for easier debugging and a smaller RAM footprint.