or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

tile.json

tessl/pypi-pystemmer

High-performance Python interface to Snowball stemming algorithms for information retrieval and text processing.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/pystemmer@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-pystemmer@3.0.0

PyStemmer

PyStemmer provides access to efficient algorithms for calculating a "stemmed" form of a word by wrapping the libstemmer library from the Snowball project in a Python module. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing "cycling" given the query "cycles".

Package Information

Package Name: PyStemmer
Package Type: pypi
Language: Python
Installation: pip install pystemmer
License: MIT, BSD

Core Imports

import Stemmer

Basic Usage

import Stemmer

# Get list of available algorithms
algorithms = Stemmer.algorithms()
print(algorithms)  # ['arabic', 'armenian', 'basque', 'catalan', ...]

# Create a stemmer instance for English
stemmer = Stemmer.Stemmer('english')

# Stem a single word
stemmed = stemmer.stemWord('cycling')
print(stemmed)  # 'cycl'

# Stem multiple words
stems = stemmer.stemWords(['cycling', 'cyclist', 'cycles'])
print(stems)  # ['cycl', 'cyclist', 'cycl']

# Configure cache size (default: 10000)
stemmer.maxCacheSize = 5000

# Disable cache entirely
stemmer.maxCacheSize = 0

Architecture

PyStemmer wraps the libstemmer_c library through Cython extensions for high performance. The Stemmer class maintains internal state including a cache for improved performance on repeated words. Each stemmer instance is tied to a specific language algorithm but is not thread-safe - create separate instances for concurrent use.

Capabilities

Algorithm Discovery

Query available stemming algorithms and get version information.

def algorithms(aliases=False):
    """
    Get a list of the names of the available stemming algorithms.
    
    Args:
        aliases (bool, optional): If False (default), returns only canonical 
            algorithm names; if True, includes aliases
    
    Returns:
        list: List of strings containing algorithm names
    """

def version():
    """
    Get the version string of the stemming module.
    
    Note: This returns the internal libstemmer version (currently '2.0.1'),
    which may differ from the PyStemmer package version.
    
    Returns:
        str: Version string for the internal stemmer module
    """

Stemmer Class

Core stemming functionality with caching support for high performance.

class Stemmer:
    def __init__(self, algorithm, maxCacheSize=10000):
        """
        Initialize a stemmer for the specified algorithm.
        
        Args:
            algorithm (str): Name of stemming algorithm to use (from algorithms() list)
            maxCacheSize (int, optional): Maximum cache size, default 10000, 
                set to 0 to disable cache
        
        Raises:
            KeyError: If algorithm not found
        """
    
    @property
    def maxCacheSize(self):
        """
        Maximum number of entries to allow in the cache.
        
        This may be set to zero to disable the cache entirely.
        The maximum cache size may be set at any point. Setting a smaller
        maximum size will trigger cache purging using an LRU-style algorithm
        that removes less recently used entries.
        
        Returns:
            int: Current maximum cache size
        """
    
    @maxCacheSize.setter  
    def maxCacheSize(self, size):
        """
        Set maximum cache size.
        
        Args:
            size (int): New maximum size (0 disables cache completely)
        """
        
    def stemWord(self, word):
        """
        Stem a single word.
        
        Args:
            word (str or unicode): Word to stem, UTF-8 encoded string or unicode object
        
        Returns:
            str or unicode: Stemmed word (same type as input)
        """
    
    def stemWords(self, words):
        """
        Stem a sequence of words.
        
        Args:
            words (sequence): Sequence, iterator, or generator of words to stem
        
        Returns:
            list: List of stemmed words (preserves individual word encoding types)
        """

Supported Languages

PyStemmer supports 25+ languages through the Snowball algorithms:

European: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Romanian, Hungarian, Greek, Catalan, Basque, Irish, Lithuanian
Middle Eastern: Arabic, Turkish
Asian: Hindi, Indonesian, Nepali, Tamil
Other: Yiddish, Serbian, Armenian, Esperanto

Special algorithms:

porter: Classic Porter stemming algorithm for English (for research/compatibility)
english: Improved Snowball English algorithm (recommended for most users)

Thread Safety

Stemmer instances are not thread-safe. For concurrent processing:

Recommended: Create separate Stemmer instances for each thread
Alternative: Use threading locks to protect shared stemmer access (may reduce performance)

The stemmer code itself is re-entrant, so multiple instances can run concurrently without issues.

Performance

Built on high-performance C extensions via Cython
Internal caching significantly improves performance for repeated words
Default cache size (10000) is optimized for typical text processing
Cache can be tuned or disabled based on usage patterns
Reuse stemmer instances rather than creating new ones for each operation

Cache Behavior

The internal cache uses an LRU-style purging strategy:

When cache size exceeds maxCacheSize, older entries are purged
Purging retains approximately 80% of the most recently used entries
Each word access updates its usage counter for LRU tracking
Cache lookups and updates happen automatically during stemming operations

Error Handling

KeyError: Raised when creating stemmer with unknown algorithm name
Use Stemmer.algorithms() to get list of valid algorithm names
Input encoding is handled automatically (UTF-8 strings and unicode objects supported)
No exceptions raised for empty strings or None inputs - they are processed normally
Cache operations are transparent and do not raise exceptions under normal conditions