nltk.SnowballStemmer

class nltk.SnowballStemmer(language, ignore_stopwords=False)[source]

Snowball Stemmer

The following languages are supported: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

The algorithm for English is documented here:

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

The algorithms have been developed by Martin Porter. These stemmers are called Snowball, because Porter created a programming language with this name for creating new stemming algorithms. There is more information available at http://snowball.tartarus.org/

The stemmer is invoked as shown below:

>>> from nltk.stem import SnowballStemmer
>>> print(" ".join(SnowballStemmer.languages)) # See which languages are supported
danish dutch english finnish french german hungarian
italian norwegian porter portuguese romanian russian
spanish swedish
>>> stemmer = SnowballStemmer("german") # Choose a language
>>> stemmer.stem("Autobahnen") # Stem a word
'autobahn'

Invoking the stemmers that way is useful if you do not know the language to be stemmed at runtime. Alternatively, if you already know the language, then you can invoke the language specific stemmer directly:

>>> from nltk.stem.snowball import GermanStemmer
>>> stemmer = GermanStemmer()
>>> stemmer.stem("Autobahnen")
'autobahn'
Parameters:
  • language (str or unicode) – The language whose subclass is instantiated.
  • ignore_stopwords (bool) – If set to True, stopwords are not stemmed and returned unchanged. Set to False by default.
Raises:

ValueError – If there is no stemmer for the specified language, a ValueError is raised.

Methods

__init__(language[, ignore_stopwords])
stem(token) Strip affixes from the token and return the stem.

Attributes

languages