gensim.models.Phrases.__init__

Phrases.__init__(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_')[source]

Initialize the model from an iterable of sentences. Each sentence must be a list of words (unicode strings) that will be used for training.

The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, without storing everything in RAM. See BrownCorpus, Text8Corpus or LineSentence in the gensim.models.word2vec module for such examples.

min_count ignore all words and bigrams with total collected count lower than this.

threshold represents a threshold for forming the phrases (higher means fewer phrases). A phrase of words a and b is accepted if (cnt(a, b) - min_count) * N / (cnt(a) * cnt(b)) > threshold, where N is the total vocabulary size.

max_vocab_size is the maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM; increase/decrease max_vocab_size depending on how much available memory you have.

delimiter is the glue character used to join collocation tokens, and should be a byte string (e.g. b’_’).