0
# Snowballstemmer
1
2
A comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms. Enables text processing applications to reduce words to their base forms for improved search and analysis, supporting multilingual text processing systems, search engines, and data analysis pipelines.
3
4
## Package Information
5
6
- **Package Name**: snowballstemmer
7
- **Language**: Python
8
- **Installation**: `pip install snowballstemmer`
9
- **License**: BSD-3-Clause
10
- **Version**: 3.0.1
11
12
## Core Imports
13
14
```python
15
import snowballstemmer
16
```
17
18
Access the public API functions:
19
20
```python
21
from snowballstemmer import algorithms, stemmer
22
```
23
24
## Basic Usage
25
26
```python
27
import snowballstemmer
28
29
# Get list of available languages
30
languages = snowballstemmer.algorithms()
31
print(f"Available languages: {len(languages)} total")
32
33
# Create a stemmer for English
34
stemmer = snowballstemmer.stemmer('english')
35
36
# Stem individual words
37
stemmed = stemmer.stemWord('running')
38
print(f"running -> {stemmed}") # prints: running -> run
39
40
# Stem multiple words at once
41
words = ['running', 'connected', 'connections', 'easily']
42
stemmed_words = stemmer.stemWords(words)
43
print(f"Original: {words}")
44
print(f"Stemmed: {stemmed_words}")
45
46
# Use different languages
47
french_stemmer = snowballstemmer.stemmer('french')
48
spanish_stemmer = snowballstemmer.stemmer('spanish')
49
50
print(f"French: 'connexions' -> {french_stemmer.stemWord('connexions')}")
51
print(f"Spanish: 'corriendo' -> {spanish_stemmer.stemWord('corriendo')}")
52
```
53
54
## Capabilities
55
56
### Language Discovery
57
58
Retrieve available stemming algorithms and supported languages.
59
60
```python { .api }
61
def algorithms():
62
"""
63
Get list of available stemming algorithm names.
64
65
Returns:
66
list: List of strings representing available language codes
67
68
Note:
69
Automatically returns C extension algorithms if available, otherwise pure Python algorithms.
70
This function checks for the Stemmer C extension and falls back gracefully.
71
"""
72
```
73
74
### Stemmer Factory
75
76
Create stemmer instances for specific languages with automatic fallback between C extension and pure Python implementations.
77
78
```python { .api }
79
def stemmer(lang):
80
"""
81
Create a stemmer instance for the specified language.
82
83
Parameters:
84
lang (str): Language code for desired stemming algorithm.
85
Supports multiple formats: 'english', 'en', 'eng'
86
87
Returns:
88
Stemmer: A stemmer instance with stemWord() and stemWords() methods
89
90
Raises:
91
KeyError: If stemming algorithm for language not found
92
93
Note:
94
Automatically uses C extension (Stemmer.Stemmer) if available,
95
otherwise falls back to pure Python implementation.
96
"""
97
```
98
99
### Word Stemming
100
101
Core stemming functionality for reducing words to their base forms. The stemmer instances returned by `stemmer()` provide these methods:
102
103
```python { .api }
104
# Stemmer instance methods (available on returned stemmer objects)
105
def stemWord(word):
106
"""
107
Stem a single word to its base form.
108
109
Parameters:
110
word (str): Word to stem
111
112
Returns:
113
str: Stemmed word
114
"""
115
116
def stemWords(words):
117
"""
118
Stem multiple words to their base forms.
119
120
Parameters:
121
words (list): List of words to stem
122
123
Returns:
124
list: List of stemmed words in same order
125
"""
126
```
127
128
129
## Supported Languages
130
131
Snowball stemmer supports 33 language algorithms across 31 languages, with multiple aliases for each.
132
133
### Primary Languages (31 total)
134
135
- **arabic** (ar, ara): Arabic stemming
136
- **armenian** (hy, hye, arm): Armenian stemming
137
- **basque** (eu, eus, baq): Basque stemming
138
- **catalan** (ca, cat): Catalan stemming
139
- **danish** (da, dan): Danish stemming
140
- **dutch** (nl, dut, nld, kraaij_pohlmann): Dutch stemming (Kraaij-Pohlmann algorithm)
141
- **english** (en, eng): English stemming
142
- **esperanto** (eo, epo): Esperanto stemming
143
- **estonian** (et, est): Estonian stemming
144
- **finnish** (fi, fin): Finnish stemming
145
- **french** (fr, fre, fra): French stemming
146
- **german** (de, ger, deu): German stemming
147
- **greek** (el, gre, ell): Greek stemming
148
- **hindi** (hi, hin): Hindi stemming
149
- **hungarian** (hu, hun): Hungarian stemming
150
- **indonesian** (id, ind): Indonesian stemming
151
- **irish** (ga, gle): Irish stemming
152
- **italian** (it, ita): Italian stemming
153
- **lithuanian** (lt, lit): Lithuanian stemming
154
- **nepali** (ne, nep): Nepali stemming
155
- **norwegian** (no, nor): Norwegian stemming
156
- **portuguese** (pt, por): Portuguese stemming
157
- **romanian** (ro, rum, ron): Romanian stemming
158
- **russian** (ru, rus): Russian stemming
159
- **serbian** (sr, srp): Serbian stemming
160
- **spanish** (es, esl, spa): Spanish stemming
161
- **swedish** (sv, swe): Swedish stemming
162
- **tamil** (ta, tam): Tamil stemming
163
- **turkish** (tr, tur): Turkish stemming
164
- **yiddish** (yi, yid): Yiddish stemming
165
166
### Algorithm Variants (2 additional)
167
168
- **porter**: Traditional Porter algorithm for English (variant of english)
169
- **dutch_porter**: Martin Porter's Dutch stemmer (variant of dutch)
170
171
## Character Encoding Support
172
173
- **UTF-8**: All 33 languages/variants
174
- **ISO-8859-1**: basque, catalan, danish, dutch, english, finnish, french, german, indonesian, irish, italian, norwegian, portuguese, spanish, swedish, porter, dutch_porter
175
- **ISO-8859-2**: hungarian
176
- **KOI8-R**: russian
177
178
## Performance Optimization
179
180
The library automatically uses C extensions when available for significant performance improvements. The `algorithms()` and `stemmer()` functions transparently choose the best available implementation:
181
182
```python
183
import snowballstemmer
184
185
# Automatically uses C extension if available, pure Python otherwise
186
stemmer = snowballstemmer.stemmer('english')
187
188
# Both implementations provide identical API
189
# C extension: faster performance
190
# Pure Python: broader compatibility
191
```
192
193
## Error Handling
194
195
```python
196
try:
197
# Invalid language code
198
stemmer = snowballstemmer.stemmer('klingon')
199
except KeyError as e:
200
print(f"Language not supported: {e}")
201
202
# Safe language checking
203
available_langs = snowballstemmer.algorithms()
204
if 'german' in available_langs:
205
german_stemmer = snowballstemmer.stemmer('german')
206
else:
207
print("German stemming not available")
208
```
209
210
## Advanced Usage Examples
211
212
### Batch Processing
213
214
```python
215
import snowballstemmer
216
217
def process_multilingual_text(text_dict):
218
"""Process text in multiple languages."""
219
results = {}
220
221
for lang, words in text_dict.items():
222
try:
223
stemmer = snowballstemmer.stemmer(lang)
224
results[lang] = stemmer.stemWords(words)
225
except KeyError:
226
print(f"Warning: Language '{lang}' not supported")
227
results[lang] = words # Return original words
228
229
return results
230
231
# Example usage
232
texts = {
233
'english': ['running', 'connection', 'easily'],
234
'french': ['connexions', 'facilement', 'courant'],
235
'spanish': ['corriendo', 'conexión', 'fácilmente']
236
}
237
238
stemmed_results = process_multilingual_text(texts)
239
for lang, words in stemmed_results.items():
240
print(f"{lang}: {words}")
241
```
242
243
### Search Index Preparation
244
245
```python
246
import snowballstemmer
247
import re
248
249
class SearchIndexer:
250
def __init__(self, language='english'):
251
self.stemmer = snowballstemmer.stemmer(language)
252
self.word_pattern = re.compile(r'\b\w+\b')
253
254
def index_document(self, text):
255
"""Extract and stem words from document text."""
256
words = self.word_pattern.findall(text.lower())
257
return self.stemmer.stemWords(words)
258
259
def normalize_query(self, query):
260
"""Normalize search query for matching."""
261
words = self.word_pattern.findall(query.lower())
262
return self.stemmer.stemWords(words)
263
264
# Example usage
265
indexer = SearchIndexer('english')
266
document = "The quick brown foxes are running through the connected fields"
267
query = "quick brown fox running connections"
268
269
doc_terms = indexer.index_document(document)
270
query_terms = indexer.normalize_query(query)
271
272
print(f"Document terms: {doc_terms}")
273
print(f"Query terms: {query_terms}")
274
# Both 'running' and 'connected'/'connections' will match their stemmed forms
275
```