gensim.corpora.SvmLightCorpus

class gensim.corpora.SvmLightCorpus(fname, store_labels=True)[source]

Corpus in SVMlight format.

Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

The “qid” feature (used for SVMlight ranking), if present, is ignored.

Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.

Methods

__init__(fname[, store_labels]) Initialize the corpus from a file.
doc2line(doc[, label]) Output the document in SVMlight format, as a string.
docbyoffset(offset) Return the document stored at file position offset.
line2doc(line) Create a document from a single line (string) in SVMlight format
load(fname[, mmap]) Load a previously saved object from file (also see save).
save(*args, **kwargs)
save_corpus(fname, corpus[, id2word, ...]) Save a corpus in the SVMlight format.
serialize(serializer, fname, corpus[, ...]) Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document.