nltk.LineTokenizer

class nltk.LineTokenizer(blanklines=u'discard')[source]

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']
Parameters:blanklines

Indicates how blank lines should be handled. Valid values are:

  • discard: strip blank lines out of the token list before returning it.
    A line is considered blank if it contains only whitespace characters.
  • keep: leave all blank lines in the token list.
  • discard-eof: if the string ends with a newline, then do not generate
    a corresponding token '' after that newline.

Methods

__init__([blanklines])
span_tokenize(s)
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(s)
tokenize_sents(strings) Apply self.tokenize() to each element of strings.