nltk.LineTokenizer
¶
-
class
nltk.
LineTokenizer
(blanklines=u'discard')[source]¶ Tokenize a string into its lines, optionally discarding blank lines. This is similar to
s.split('\n')
.>>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.']
Parameters: blanklines – Indicates how blank lines should be handled. Valid values are:
discard
: strip blank lines out of the token list before returning it.- A line is considered blank if it contains only whitespace characters.
keep
: leave all blank lines in the token list.discard-eof
: if the string ends with a newline, then do not generate- a corresponding token
''
after that newline.
Methods¶
__init__ ([blanklines]) |
|
span_tokenize (s) |
|
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
tokenize (s) |
|
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |