nltk.PunktSentenceTokenizer

class nltk.PunktSentenceTokenizer(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Methods

__init__([train_text, verbose, lang_vars, ...]) train_text can either be the sole training text for this sentence
debug_decisions(text) Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
dump(tokens)
sentences_from_text(text[, realign_boundaries]) Given a text, generates the sentences in that text by only testing candidate sentence breaks.
sentences_from_text_legacy(text) Given a text, generates the sentences in that text.
sentences_from_tokens(tokens) Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
span_tokenize(text[, realign_boundaries]) Given a text, returns a list of the (start, end) spans of sentences in the text.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
text_contains_sentbreak(text) Returns True if the given text includes a sentence break.
tokenize(text[, realign_boundaries]) Given a text, returns a list of the sentences in that text.
tokenize_sents(strings) Apply self.tokenize() to each element of strings.
train(train_text[, verbose]) Derives parameters from a given training text, or uses the parameters given.

Attributes

PUNCTUATION