nltk.PunktSentenceTokenizer
¶
-
class
nltk.
PunktSentenceTokenizer
(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶ A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
Methods¶
__init__ ([train_text, verbose, lang_vars, ...]) |
train_text can either be the sole training text for this sentence |
debug_decisions (text) |
Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made. |
dump (tokens) |
|
sentences_from_text (text[, realign_boundaries]) |
Given a text, generates the sentences in that text by only testing candidate sentence breaks. |
sentences_from_text_legacy (text) |
Given a text, generates the sentences in that text. |
sentences_from_tokens (tokens) |
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence. |
span_tokenize (text[, realign_boundaries]) |
Given a text, returns a list of the (start, end) spans of sentences in the text. |
span_tokenize_sents (strings) |
Apply self.span_tokenize() to each element of strings . |
text_contains_sentbreak (text) |
Returns True if the given text includes a sentence break. |
tokenize (text[, realign_boundaries]) |
Given a text, returns a list of the sentences in that text. |
tokenize_sents (strings) |
Apply self.tokenize() to each element of strings . |
train (train_text[, verbose]) |
Derives parameters from a given training text, or uses the parameters given. |