nltk.ngrams()
¶
-
nltk.
ngrams
(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]¶ Return the ngrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import ngrams >>> list(ngrams([1,2,3,4,5], 3)) [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use ngrams for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True)) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>')) [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
Parameters: - sequence (sequence or iter) – the source data to be converted into ngrams
- n (int) – the degree of the ngrams
- pad_left (bool) – whether the ngrams should be left-padded
- pad_right (bool) – whether the ngrams should be right-padded
- left_pad_symbol (any) – the symbol to use for left padding (default is None)
- right_pad_symbol (any) – the symbol to use for right padding (default is None)
Return type: sequence or iter