nltk.tokenize.TextTilingTokenizer

class nltk.tokenize.TextTilingTokenizer(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Parameters:
  • w (int) – Pseudosentence size
  • k (int) – Size (in sentences) of the block used in the block comparison method
  • similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
  • stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)
  • smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
  • smoothing_width (int) – The width of the window used by the smoothing method
  • smoothing_rounds (int) – The number of smoothing passes
  • cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC
>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:10000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Methods

__init__([w, k, similarity_method, ...])
span_tokenize(s) Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
span_tokenize_sents(strings) Apply self.span_tokenize() to each element of strings.
tokenize(text) Return a tokenized copy of text, where each “token” represents a separate topic.
tokenize_sents(strings) Apply self.tokenize() to each element of strings.