High-performance Python interface to Snowball stemming algorithms for information retrieval and text processing.
npx @tessl/cli install tessl/pypi-pystemmer@3.0.00
# PyStemmer
1
2
PyStemmer provides access to efficient algorithms for calculating a "stemmed" form of a word by wrapping the libstemmer library from the Snowball project in a Python module. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing "cycling" given the query "cycles".
3
4
## Package Information
5
6
- **Package Name**: PyStemmer
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install pystemmer`
10
- **License**: MIT, BSD
11
12
## Core Imports
13
14
```python
15
import Stemmer
16
```
17
18
## Basic Usage
19
20
```python
21
import Stemmer
22
23
# Get list of available algorithms
24
algorithms = Stemmer.algorithms()
25
print(algorithms) # ['arabic', 'armenian', 'basque', 'catalan', ...]
26
27
# Create a stemmer instance for English
28
stemmer = Stemmer.Stemmer('english')
29
30
# Stem a single word
31
stemmed = stemmer.stemWord('cycling')
32
print(stemmed) # 'cycl'
33
34
# Stem multiple words
35
stems = stemmer.stemWords(['cycling', 'cyclist', 'cycles'])
36
print(stems) # ['cycl', 'cyclist', 'cycl']
37
38
# Configure cache size (default: 10000)
39
stemmer.maxCacheSize = 5000
40
41
# Disable cache entirely
42
stemmer.maxCacheSize = 0
43
```
44
45
## Architecture
46
47
PyStemmer wraps the libstemmer_c library through Cython extensions for high performance. The `Stemmer` class maintains internal state including a cache for improved performance on repeated words. Each stemmer instance is tied to a specific language algorithm but is not thread-safe - create separate instances for concurrent use.
48
49
## Capabilities
50
51
### Algorithm Discovery
52
53
Query available stemming algorithms and get version information.
54
55
```python { .api }
56
def algorithms(aliases=False):
57
"""
58
Get a list of the names of the available stemming algorithms.
59
60
Args:
61
aliases (bool, optional): If False (default), returns only canonical
62
algorithm names; if True, includes aliases
63
64
Returns:
65
list: List of strings containing algorithm names
66
"""
67
68
def version():
69
"""
70
Get the version string of the stemming module.
71
72
Note: This returns the internal libstemmer version (currently '2.0.1'),
73
which may differ from the PyStemmer package version.
74
75
Returns:
76
str: Version string for the internal stemmer module
77
"""
78
```
79
80
### Stemmer Class
81
82
Core stemming functionality with caching support for high performance.
83
84
```python { .api }
85
class Stemmer:
86
def __init__(self, algorithm, maxCacheSize=10000):
87
"""
88
Initialize a stemmer for the specified algorithm.
89
90
Args:
91
algorithm (str): Name of stemming algorithm to use (from algorithms() list)
92
maxCacheSize (int, optional): Maximum cache size, default 10000,
93
set to 0 to disable cache
94
95
Raises:
96
KeyError: If algorithm not found
97
"""
98
99
@property
100
def maxCacheSize(self):
101
"""
102
Maximum number of entries to allow in the cache.
103
104
This may be set to zero to disable the cache entirely.
105
The maximum cache size may be set at any point. Setting a smaller
106
maximum size will trigger cache purging using an LRU-style algorithm
107
that removes less recently used entries.
108
109
Returns:
110
int: Current maximum cache size
111
"""
112
113
@maxCacheSize.setter
114
def maxCacheSize(self, size):
115
"""
116
Set maximum cache size.
117
118
Args:
119
size (int): New maximum size (0 disables cache completely)
120
"""
121
122
def stemWord(self, word):
123
"""
124
Stem a single word.
125
126
Args:
127
word (str or unicode): Word to stem, UTF-8 encoded string or unicode object
128
129
Returns:
130
str or unicode: Stemmed word (same type as input)
131
"""
132
133
def stemWords(self, words):
134
"""
135
Stem a sequence of words.
136
137
Args:
138
words (sequence): Sequence, iterator, or generator of words to stem
139
140
Returns:
141
list: List of stemmed words (preserves individual word encoding types)
142
"""
143
```
144
145
## Supported Languages
146
147
PyStemmer supports 25+ languages through the Snowball algorithms:
148
149
- **European**: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Romanian, Hungarian, Greek, Catalan, Basque, Irish, Lithuanian
150
- **Middle Eastern**: Arabic, Turkish
151
- **Asian**: Hindi, Indonesian, Nepali, Tamil
152
- **Other**: Yiddish, Serbian, Armenian, Esperanto
153
154
Special algorithms:
155
- **porter**: Classic Porter stemming algorithm for English (for research/compatibility)
156
- **english**: Improved Snowball English algorithm (recommended for most users)
157
158
## Thread Safety
159
160
Stemmer instances are **not thread-safe**. For concurrent processing:
161
162
1. **Recommended**: Create separate `Stemmer` instances for each thread
163
2. **Alternative**: Use threading locks to protect shared stemmer access (may reduce performance)
164
165
The stemmer code itself is re-entrant, so multiple instances can run concurrently without issues.
166
167
## Performance
168
169
- Built on high-performance C extensions via Cython
170
- Internal caching significantly improves performance for repeated words
171
- Default cache size (10000) is optimized for typical text processing
172
- Cache can be tuned or disabled based on usage patterns
173
- Reuse stemmer instances rather than creating new ones for each operation
174
175
### Cache Behavior
176
177
The internal cache uses an LRU-style purging strategy:
178
- When cache size exceeds `maxCacheSize`, older entries are purged
179
- Purging retains approximately 80% of the most recently used entries
180
- Each word access updates its usage counter for LRU tracking
181
- Cache lookups and updates happen automatically during stemming operations
182
183
## Error Handling
184
185
- `KeyError`: Raised when creating stemmer with unknown algorithm name
186
- Use `Stemmer.algorithms()` to get list of valid algorithm names
187
- Input encoding is handled automatically (UTF-8 strings and unicode objects supported)
188
- No exceptions raised for empty strings or None inputs - they are processed normally
189
- Cache operations are transparent and do not raise exceptions under normal conditions