or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

tile.json

tessl/pypi-snowballstemmer

Comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/snowballstemmer@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-snowballstemmer@3.0.0

Snowballstemmer

A comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms. Enables text processing applications to reduce words to their base forms for improved search and analysis, supporting multilingual text processing systems, search engines, and data analysis pipelines.

Package Information

Package Name: snowballstemmer
Language: Python
Installation: pip install snowballstemmer
License: BSD-3-Clause
Version: 3.0.1

Core Imports

import snowballstemmer

Access the public API functions:

from snowballstemmer import algorithms, stemmer

Basic Usage

import snowballstemmer

# Get list of available languages
languages = snowballstemmer.algorithms()
print(f"Available languages: {len(languages)} total")

# Create a stemmer for English
stemmer = snowballstemmer.stemmer('english')

# Stem individual words
stemmed = stemmer.stemWord('running')
print(f"running -> {stemmed}")  # prints: running -> run

# Stem multiple words at once
words = ['running', 'connected', 'connections', 'easily']
stemmed_words = stemmer.stemWords(words)
print(f"Original: {words}")
print(f"Stemmed: {stemmed_words}")

# Use different languages
french_stemmer = snowballstemmer.stemmer('french')
spanish_stemmer = snowballstemmer.stemmer('spanish')

print(f"French: 'connexions' -> {french_stemmer.stemWord('connexions')}")
print(f"Spanish: 'corriendo' -> {spanish_stemmer.stemWord('corriendo')}")

Capabilities

Language Discovery

Retrieve available stemming algorithms and supported languages.

def algorithms():
    """
    Get list of available stemming algorithm names.
    
    Returns:
        list: List of strings representing available language codes
        
    Note:
        Automatically returns C extension algorithms if available, otherwise pure Python algorithms.
        This function checks for the Stemmer C extension and falls back gracefully.
    """

Stemmer Factory

Create stemmer instances for specific languages with automatic fallback between C extension and pure Python implementations.

def stemmer(lang):
    """
    Create a stemmer instance for the specified language.
    
    Parameters:
        lang (str): Language code for desired stemming algorithm.
                   Supports multiple formats: 'english', 'en', 'eng'
    
    Returns:
        Stemmer: A stemmer instance with stemWord() and stemWords() methods
        
    Raises:
        KeyError: If stemming algorithm for language not found
        
    Note:
        Automatically uses C extension (Stemmer.Stemmer) if available, 
        otherwise falls back to pure Python implementation.
    """

Word Stemming

Core stemming functionality for reducing words to their base forms. The stemmer instances returned by stemmer() provide these methods:

# Stemmer instance methods (available on returned stemmer objects)
def stemWord(word):
    """
    Stem a single word to its base form.
    
    Parameters:
        word (str): Word to stem
        
    Returns:
        str: Stemmed word
    """

def stemWords(words):
    """
    Stem multiple words to their base forms.
    
    Parameters:
        words (list): List of words to stem
        
    Returns:
        list: List of stemmed words in same order
    """

Supported Languages

Snowball stemmer supports 33 language algorithms across 31 languages, with multiple aliases for each.

Primary Languages (31 total)

arabic (ar, ara): Arabic stemming
armenian (hy, hye, arm): Armenian stemming
basque (eu, eus, baq): Basque stemming
catalan (ca, cat): Catalan stemming
danish (da, dan): Danish stemming
dutch (nl, dut, nld, kraaij_pohlmann): Dutch stemming (Kraaij-Pohlmann algorithm)
english (en, eng): English stemming
esperanto (eo, epo): Esperanto stemming
estonian (et, est): Estonian stemming
finnish (fi, fin): Finnish stemming
french (fr, fre, fra): French stemming
german (de, ger, deu): German stemming
greek (el, gre, ell): Greek stemming
hindi (hi, hin): Hindi stemming
hungarian (hu, hun): Hungarian stemming
indonesian (id, ind): Indonesian stemming
irish (ga, gle): Irish stemming
italian (it, ita): Italian stemming
lithuanian (lt, lit): Lithuanian stemming
nepali (ne, nep): Nepali stemming
norwegian (no, nor): Norwegian stemming
portuguese (pt, por): Portuguese stemming
romanian (ro, rum, ron): Romanian stemming
russian (ru, rus): Russian stemming
serbian (sr, srp): Serbian stemming
spanish (es, esl, spa): Spanish stemming
swedish (sv, swe): Swedish stemming
tamil (ta, tam): Tamil stemming
turkish (tr, tur): Turkish stemming
yiddish (yi, yid): Yiddish stemming

Algorithm Variants (2 additional)

porter: Traditional Porter algorithm for English (variant of english)
dutch_porter: Martin Porter's Dutch stemmer (variant of dutch)

Character Encoding Support

UTF-8: All 33 languages/variants
ISO-8859-1: basque, catalan, danish, dutch, english, finnish, french, german, indonesian, irish, italian, norwegian, portuguese, spanish, swedish, porter, dutch_porter
ISO-8859-2: hungarian
KOI8-R: russian

Performance Optimization

The library automatically uses C extensions when available for significant performance improvements. The algorithms() and stemmer() functions transparently choose the best available implementation:

import snowballstemmer

# Automatically uses C extension if available, pure Python otherwise
stemmer = snowballstemmer.stemmer('english')

# Both implementations provide identical API
# C extension: faster performance
# Pure Python: broader compatibility

Error Handling

try:
    # Invalid language code
    stemmer = snowballstemmer.stemmer('klingon')
except KeyError as e:
    print(f"Language not supported: {e}")

# Safe language checking
available_langs = snowballstemmer.algorithms()
if 'german' in available_langs:
    german_stemmer = snowballstemmer.stemmer('german')
else:
    print("German stemming not available")

Advanced Usage Examples

Batch Processing

import snowballstemmer

def process_multilingual_text(text_dict):
    """Process text in multiple languages."""
    results = {}
    
    for lang, words in text_dict.items():
        try:
            stemmer = snowballstemmer.stemmer(lang)
            results[lang] = stemmer.stemWords(words)
        except KeyError:
            print(f"Warning: Language '{lang}' not supported")
            results[lang] = words  # Return original words
    
    return results

# Example usage
texts = {
    'english': ['running', 'connection', 'easily'],
    'french': ['connexions', 'facilement', 'courant'],
    'spanish': ['corriendo', 'conexión', 'fácilmente']
}

stemmed_results = process_multilingual_text(texts)
for lang, words in stemmed_results.items():
    print(f"{lang}: {words}")

Search Index Preparation

import snowballstemmer
import re

class SearchIndexer:
    def __init__(self, language='english'):
        self.stemmer = snowballstemmer.stemmer(language)
        self.word_pattern = re.compile(r'\b\w+\b')
    
    def index_document(self, text):
        """Extract and stem words from document text."""
        words = self.word_pattern.findall(text.lower())
        return self.stemmer.stemWords(words)
    
    def normalize_query(self, query):
        """Normalize search query for matching."""
        words = self.word_pattern.findall(query.lower())
        return self.stemmer.stemWords(words)

# Example usage
indexer = SearchIndexer('english')
document = "The quick brown foxes are running through the connected fields"
query = "quick brown fox running connections"

doc_terms = indexer.index_document(document)
query_terms = indexer.normalize_query(query)

print(f"Document terms: {doc_terms}")
print(f"Query terms: {query_terms}")
# Both 'running' and 'connected'/'connections' will match their stemmed forms