gensim.corpora.WikiCorpus

class gensim.corpora.WikiCorpus(fname, processes=None, lemmatize=True, dictionary=None, filter_namespaces=('0', ))[source]

Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word

Methods

__init__(fname[, processes, lemmatize, ...]) Initialize the corpus.
get_texts() Iterate over the dump, returning text version of each article as a list of tokens.
getstream()
load(fname[, mmap]) Load a previously saved object from file (also see save).
save(*args, **kwargs)
save_corpus(fname, corpus[, id2word, metadata]) Save an existing corpus to disk.