High-performance Python interface to Snowball stemming algorithms for information retrieval and text processing.
npx @tessl/cli install tessl/pypi-pystemmer@3.0.0PyStemmer provides access to efficient algorithms for calculating a "stemmed" form of a word by wrapping the libstemmer library from the Snowball project in a Python module. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing "cycling" given the query "cycles".
pip install pystemmerimport Stemmerimport Stemmer
# Get list of available algorithms
algorithms = Stemmer.algorithms()
print(algorithms) # ['arabic', 'armenian', 'basque', 'catalan', ...]
# Create a stemmer instance for English
stemmer = Stemmer.Stemmer('english')
# Stem a single word
stemmed = stemmer.stemWord('cycling')
print(stemmed) # 'cycl'
# Stem multiple words
stems = stemmer.stemWords(['cycling', 'cyclist', 'cycles'])
print(stems) # ['cycl', 'cyclist', 'cycl']
# Configure cache size (default: 10000)
stemmer.maxCacheSize = 5000
# Disable cache entirely
stemmer.maxCacheSize = 0PyStemmer wraps the libstemmer_c library through Cython extensions for high performance. The Stemmer class maintains internal state including a cache for improved performance on repeated words. Each stemmer instance is tied to a specific language algorithm but is not thread-safe - create separate instances for concurrent use.
Query available stemming algorithms and get version information.
def algorithms(aliases=False):
"""
Get a list of the names of the available stemming algorithms.
Args:
aliases (bool, optional): If False (default), returns only canonical
algorithm names; if True, includes aliases
Returns:
list: List of strings containing algorithm names
"""
def version():
"""
Get the version string of the stemming module.
Note: This returns the internal libstemmer version (currently '2.0.1'),
which may differ from the PyStemmer package version.
Returns:
str: Version string for the internal stemmer module
"""Core stemming functionality with caching support for high performance.
class Stemmer:
def __init__(self, algorithm, maxCacheSize=10000):
"""
Initialize a stemmer for the specified algorithm.
Args:
algorithm (str): Name of stemming algorithm to use (from algorithms() list)
maxCacheSize (int, optional): Maximum cache size, default 10000,
set to 0 to disable cache
Raises:
KeyError: If algorithm not found
"""
@property
def maxCacheSize(self):
"""
Maximum number of entries to allow in the cache.
This may be set to zero to disable the cache entirely.
The maximum cache size may be set at any point. Setting a smaller
maximum size will trigger cache purging using an LRU-style algorithm
that removes less recently used entries.
Returns:
int: Current maximum cache size
"""
@maxCacheSize.setter
def maxCacheSize(self, size):
"""
Set maximum cache size.
Args:
size (int): New maximum size (0 disables cache completely)
"""
def stemWord(self, word):
"""
Stem a single word.
Args:
word (str or unicode): Word to stem, UTF-8 encoded string or unicode object
Returns:
str or unicode: Stemmed word (same type as input)
"""
def stemWords(self, words):
"""
Stem a sequence of words.
Args:
words (sequence): Sequence, iterator, or generator of words to stem
Returns:
list: List of stemmed words (preserves individual word encoding types)
"""PyStemmer supports 25+ languages through the Snowball algorithms:
Special algorithms:
Stemmer instances are not thread-safe. For concurrent processing:
Stemmer instances for each threadThe stemmer code itself is re-entrant, so multiple instances can run concurrently without issues.
The internal cache uses an LRU-style purging strategy:
maxCacheSize, older entries are purgedKeyError: Raised when creating stemmer with unknown algorithm nameStemmer.algorithms() to get list of valid algorithm names