gensim.corpora.WikiCorpus
¶
-
class
gensim.corpora.
WikiCorpus
(fname, processes=None, lemmatize=True, dictionary=None, filter_namespaces=('0', ))[source]¶ Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h >>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word
Methods¶
__init__ (fname[, processes, lemmatize, ...]) |
Initialize the corpus. |
get_texts () |
Iterate over the dump, returning text version of each article as a list of tokens. |
getstream () |
|
load (fname[, mmap]) |
Load a previously saved object from file (also see save). |
save (*args, **kwargs) |
|
save_corpus (fname, corpus[, id2word, metadata]) |
Save an existing corpus to disk. |