gensim.corpora.SvmLightCorpus
¶
-
class
gensim.corpora.
SvmLightCorpus
(fname, store_labels=True)[source]¶ Corpus in SVMlight format.
Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> <target> .=. +1 | -1 | 0 | <float> <feature> .=. <integer> | "qid" <value> .=. <float> <info> .=. <string>
The “qid” feature (used for SVMlight ranking), if present, is ignored.
Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.
Methods¶
__init__ (fname[, store_labels]) |
Initialize the corpus from a file. |
doc2line (doc[, label]) |
Output the document in SVMlight format, as a string. |
docbyoffset (offset) |
Return the document stored at file position offset. |
line2doc (line) |
Create a document from a single line (string) in SVMlight format |
load (fname[, mmap]) |
Load a previously saved object from file (also see save). |
save (*args, **kwargs) |
|
save_corpus (fname, corpus[, id2word, ...]) |
Save a corpus in the SVMlight format. |
serialize (serializer, fname, corpus[, ...]) |
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. |