`gensim.corpora.WikiCorpus`¶

class gensim.corpora.WikiCorpus(fname, processes=None, lemmatize=True, dictionary=None, filter_namespaces=('0', ))[source]¶

Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word

Methods¶

`__init__`(fname[, processes, lemmatize, ...])	Initialize the corpus.
`get_texts`()	Iterate over the dump, returning text version of each article as a list of tokens.
`getstream`()
`load`(fname[, mmap])	Load a previously saved object from file (also see save).
`save`(args, *kwargs)
`save_corpus`(fname, corpus[, id2word, metadata])	Save an existing corpus to disk.