or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.md
tile.json

tessl/pypi-snowballstemmer

Comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/snowballstemmer@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-snowballstemmer@3.0.0

index.mddocs/

Snowballstemmer

A comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms. Enables text processing applications to reduce words to their base forms for improved search and analysis, supporting multilingual text processing systems, search engines, and data analysis pipelines.

Package Information

  • Package Name: snowballstemmer
  • Language: Python
  • Installation: pip install snowballstemmer
  • License: BSD-3-Clause
  • Version: 3.0.1

Core Imports

import snowballstemmer

Access the public API functions:

from snowballstemmer import algorithms, stemmer

Basic Usage

import snowballstemmer

# Get list of available languages
languages = snowballstemmer.algorithms()
print(f"Available languages: {len(languages)} total")

# Create a stemmer for English
stemmer = snowballstemmer.stemmer('english')

# Stem individual words
stemmed = stemmer.stemWord('running')
print(f"running -> {stemmed}")  # prints: running -> run

# Stem multiple words at once
words = ['running', 'connected', 'connections', 'easily']
stemmed_words = stemmer.stemWords(words)
print(f"Original: {words}")
print(f"Stemmed: {stemmed_words}")

# Use different languages
french_stemmer = snowballstemmer.stemmer('french')
spanish_stemmer = snowballstemmer.stemmer('spanish')

print(f"French: 'connexions' -> {french_stemmer.stemWord('connexions')}")
print(f"Spanish: 'corriendo' -> {spanish_stemmer.stemWord('corriendo')}")

Capabilities

Language Discovery

Retrieve available stemming algorithms and supported languages.

def algorithms():
    """
    Get list of available stemming algorithm names.
    
    Returns:
        list: List of strings representing available language codes
        
    Note:
        Automatically returns C extension algorithms if available, otherwise pure Python algorithms.
        This function checks for the Stemmer C extension and falls back gracefully.
    """

Stemmer Factory

Create stemmer instances for specific languages with automatic fallback between C extension and pure Python implementations.

def stemmer(lang):
    """
    Create a stemmer instance for the specified language.
    
    Parameters:
        lang (str): Language code for desired stemming algorithm.
                   Supports multiple formats: 'english', 'en', 'eng'
    
    Returns:
        Stemmer: A stemmer instance with stemWord() and stemWords() methods
        
    Raises:
        KeyError: If stemming algorithm for language not found
        
    Note:
        Automatically uses C extension (Stemmer.Stemmer) if available, 
        otherwise falls back to pure Python implementation.
    """

Word Stemming

Core stemming functionality for reducing words to their base forms. The stemmer instances returned by stemmer() provide these methods:

# Stemmer instance methods (available on returned stemmer objects)
def stemWord(word):
    """
    Stem a single word to its base form.
    
    Parameters:
        word (str): Word to stem
        
    Returns:
        str: Stemmed word
    """

def stemWords(words):
    """
    Stem multiple words to their base forms.
    
    Parameters:
        words (list): List of words to stem
        
    Returns:
        list: List of stemmed words in same order
    """

Supported Languages

Snowball stemmer supports 33 language algorithms across 31 languages, with multiple aliases for each.

Primary Languages (31 total)

  • arabic (ar, ara): Arabic stemming
  • armenian (hy, hye, arm): Armenian stemming
  • basque (eu, eus, baq): Basque stemming
  • catalan (ca, cat): Catalan stemming
  • danish (da, dan): Danish stemming
  • dutch (nl, dut, nld, kraaij_pohlmann): Dutch stemming (Kraaij-Pohlmann algorithm)
  • english (en, eng): English stemming
  • esperanto (eo, epo): Esperanto stemming
  • estonian (et, est): Estonian stemming
  • finnish (fi, fin): Finnish stemming
  • french (fr, fre, fra): French stemming
  • german (de, ger, deu): German stemming
  • greek (el, gre, ell): Greek stemming
  • hindi (hi, hin): Hindi stemming
  • hungarian (hu, hun): Hungarian stemming
  • indonesian (id, ind): Indonesian stemming
  • irish (ga, gle): Irish stemming
  • italian (it, ita): Italian stemming
  • lithuanian (lt, lit): Lithuanian stemming
  • nepali (ne, nep): Nepali stemming
  • norwegian (no, nor): Norwegian stemming
  • portuguese (pt, por): Portuguese stemming
  • romanian (ro, rum, ron): Romanian stemming
  • russian (ru, rus): Russian stemming
  • serbian (sr, srp): Serbian stemming
  • spanish (es, esl, spa): Spanish stemming
  • swedish (sv, swe): Swedish stemming
  • tamil (ta, tam): Tamil stemming
  • turkish (tr, tur): Turkish stemming
  • yiddish (yi, yid): Yiddish stemming

Algorithm Variants (2 additional)

  • porter: Traditional Porter algorithm for English (variant of english)
  • dutch_porter: Martin Porter's Dutch stemmer (variant of dutch)

Character Encoding Support

  • UTF-8: All 33 languages/variants
  • ISO-8859-1: basque, catalan, danish, dutch, english, finnish, french, german, indonesian, irish, italian, norwegian, portuguese, spanish, swedish, porter, dutch_porter
  • ISO-8859-2: hungarian
  • KOI8-R: russian

Performance Optimization

The library automatically uses C extensions when available for significant performance improvements. The algorithms() and stemmer() functions transparently choose the best available implementation:

import snowballstemmer

# Automatically uses C extension if available, pure Python otherwise
stemmer = snowballstemmer.stemmer('english')

# Both implementations provide identical API
# C extension: faster performance
# Pure Python: broader compatibility

Error Handling

try:
    # Invalid language code
    stemmer = snowballstemmer.stemmer('klingon')
except KeyError as e:
    print(f"Language not supported: {e}")

# Safe language checking
available_langs = snowballstemmer.algorithms()
if 'german' in available_langs:
    german_stemmer = snowballstemmer.stemmer('german')
else:
    print("German stemming not available")

Advanced Usage Examples

Batch Processing

import snowballstemmer

def process_multilingual_text(text_dict):
    """Process text in multiple languages."""
    results = {}
    
    for lang, words in text_dict.items():
        try:
            stemmer = snowballstemmer.stemmer(lang)
            results[lang] = stemmer.stemWords(words)
        except KeyError:
            print(f"Warning: Language '{lang}' not supported")
            results[lang] = words  # Return original words
    
    return results

# Example usage
texts = {
    'english': ['running', 'connection', 'easily'],
    'french': ['connexions', 'facilement', 'courant'],
    'spanish': ['corriendo', 'conexión', 'fácilmente']
}

stemmed_results = process_multilingual_text(texts)
for lang, words in stemmed_results.items():
    print(f"{lang}: {words}")

Search Index Preparation

import snowballstemmer
import re

class SearchIndexer:
    def __init__(self, language='english'):
        self.stemmer = snowballstemmer.stemmer(language)
        self.word_pattern = re.compile(r'\b\w+\b')
    
    def index_document(self, text):
        """Extract and stem words from document text."""
        words = self.word_pattern.findall(text.lower())
        return self.stemmer.stemWords(words)
    
    def normalize_query(self, query):
        """Normalize search query for matching."""
        words = self.word_pattern.findall(query.lower())
        return self.stemmer.stemWords(words)

# Example usage
indexer = SearchIndexer('english')
document = "The quick brown foxes are running through the connected fields"
query = "quick brown fox running connections"

doc_terms = indexer.index_document(document)
query_terms = indexer.normalize_query(query)

print(f"Document terms: {doc_terms}")
print(f"Query terms: {query_terms}")
# Both 'running' and 'connected'/'connections' will match their stemmed forms