gensim.corpora.MalletCorpus
¶
-
class
gensim.corpora.
MalletCorpus
(fname, id2word=None, metadata=False)[source]¶ Quoting http://mallet.cs.umass.edu/import.php:
One file, one instance per line Assume the data is in the following format:
[URL] [language] [text of the page...]
- Or, more generally,
- [document #1 id] [label] [text of the document...] [document #2 id] [label] [text of the document...] ... [document #N id] [label] [text of the document...]
Note that language/label is not considered in Gensim.
Methods¶
__init__ (fname[, id2word, metadata]) |
|
docbyoffset (offset) |
Return the document stored at file position offset. |
line2doc (line) |
|
load (fname[, mmap]) |
Load a previously saved object from file (also see save). |
save (*args, **kwargs) |
|
save_corpus (fname, corpus[, id2word, metadata]) |
Save a corpus in the Mallet format. |
serialize (serializer, fname, corpus[, ...]) |
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. |