nltk.ngrams()

nltk.ngrams(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]

Return the ngrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ngrams for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:

>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
Parameters:
  • sequence (sequence or iter) – the source data to be converted into ngrams
  • n (int) – the degree of the ngrams
  • pad_left (bool) – whether the ngrams should be left-padded
  • pad_right (bool) – whether the ngrams should be right-padded
  • left_pad_symbol (any) – the symbol to use for left padding (default is None)
  • right_pad_symbol (any) – the symbol to use for right padding (default is None)
Return type:

sequence or iter