Tessl Tile for pypi/snowballstemmer@3.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-snowballstemmer

Comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/snowballstemmer@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-snowballstemmer@3.0.0

0
# Snowballstemmer
1

2
A comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms. Enables text processing applications to reduce words to their base forms for improved search and analysis, supporting multilingual text processing systems, search engines, and data analysis pipelines.
3

4
## Package Information
5

6
- **Package Name**: snowballstemmer
7
- **Language**: Python  
8
- **Installation**: `pip install snowballstemmer`
9
- **License**: BSD-3-Clause
10
- **Version**: 3.0.1
11

12
## Core Imports
13

14
```python
15
import snowballstemmer
16
```
17

18
Access the public API functions:
19

20
```python
21
from snowballstemmer import algorithms, stemmer
22
```
23

24
## Basic Usage
25

26
```python
27
import snowballstemmer
28

29
# Get list of available languages
30
languages = snowballstemmer.algorithms()
31
print(f"Available languages: {len(languages)} total")
32

33
# Create a stemmer for English
34
stemmer = snowballstemmer.stemmer('english')
35

36
# Stem individual words
37
stemmed = stemmer.stemWord('running')
38
print(f"running -> {stemmed}")  # prints: running -> run
39

40
# Stem multiple words at once
41
words = ['running', 'connected', 'connections', 'easily']
42
stemmed_words = stemmer.stemWords(words)
43
print(f"Original: {words}")
44
print(f"Stemmed: {stemmed_words}")
45

46
# Use different languages
47
french_stemmer = snowballstemmer.stemmer('french')
48
spanish_stemmer = snowballstemmer.stemmer('spanish')
49

50
print(f"French: 'connexions' -> {french_stemmer.stemWord('connexions')}")
51
print(f"Spanish: 'corriendo' -> {spanish_stemmer.stemWord('corriendo')}")
52
```
53

54
## Capabilities
55

56
### Language Discovery
57

58
Retrieve available stemming algorithms and supported languages.
59

60
```python { .api }
61
def algorithms():
62
    """
63
    Get list of available stemming algorithm names.
64
    
65
    Returns:
66
        list: List of strings representing available language codes
67
        
68
    Note:
69
        Automatically returns C extension algorithms if available, otherwise pure Python algorithms.
70
        This function checks for the Stemmer C extension and falls back gracefully.
71
    """
72
```
73

74
### Stemmer Factory
75

76
Create stemmer instances for specific languages with automatic fallback between C extension and pure Python implementations.
77

78
```python { .api }
79
def stemmer(lang):
80
    """
81
    Create a stemmer instance for the specified language.
82
    
83
    Parameters:
84
        lang (str): Language code for desired stemming algorithm.
85
                   Supports multiple formats: 'english', 'en', 'eng'
86
    
87
    Returns:
88
        Stemmer: A stemmer instance with stemWord() and stemWords() methods
89
        
90
    Raises:
91
        KeyError: If stemming algorithm for language not found
92
        
93
    Note:
94
        Automatically uses C extension (Stemmer.Stemmer) if available, 
95
        otherwise falls back to pure Python implementation.
96
    """
97
```
98

99
### Word Stemming
100

101
Core stemming functionality for reducing words to their base forms. The stemmer instances returned by `stemmer()` provide these methods:
102

103
```python { .api }
104
# Stemmer instance methods (available on returned stemmer objects)
105
def stemWord(word):
106
    """
107
    Stem a single word to its base form.
108
    
109
    Parameters:
110
        word (str): Word to stem
111
        
112
    Returns:
113
        str: Stemmed word
114
    """
115

116
def stemWords(words):
117
    """
118
    Stem multiple words to their base forms.
119
    
120
    Parameters:
121
        words (list): List of words to stem
122
        
123
    Returns:
124
        list: List of stemmed words in same order
125
    """
126
```
127

128

129
## Supported Languages
130

131
Snowball stemmer supports 33 language algorithms across 31 languages, with multiple aliases for each.
132

133
### Primary Languages (31 total)
134

135
- **arabic** (ar, ara): Arabic stemming
136
- **armenian** (hy, hye, arm): Armenian stemming  
137
- **basque** (eu, eus, baq): Basque stemming
138
- **catalan** (ca, cat): Catalan stemming
139
- **danish** (da, dan): Danish stemming
140
- **dutch** (nl, dut, nld, kraaij_pohlmann): Dutch stemming (Kraaij-Pohlmann algorithm)
141
- **english** (en, eng): English stemming
142
- **esperanto** (eo, epo): Esperanto stemming
143
- **estonian** (et, est): Estonian stemming
144
- **finnish** (fi, fin): Finnish stemming
145
- **french** (fr, fre, fra): French stemming
146
- **german** (de, ger, deu): German stemming
147
- **greek** (el, gre, ell): Greek stemming
148
- **hindi** (hi, hin): Hindi stemming
149
- **hungarian** (hu, hun): Hungarian stemming
150
- **indonesian** (id, ind): Indonesian stemming
151
- **irish** (ga, gle): Irish stemming
152
- **italian** (it, ita): Italian stemming
153
- **lithuanian** (lt, lit): Lithuanian stemming
154
- **nepali** (ne, nep): Nepali stemming
155
- **norwegian** (no, nor): Norwegian stemming
156
- **portuguese** (pt, por): Portuguese stemming
157
- **romanian** (ro, rum, ron): Romanian stemming
158
- **russian** (ru, rus): Russian stemming
159
- **serbian** (sr, srp): Serbian stemming
160
- **spanish** (es, esl, spa): Spanish stemming
161
- **swedish** (sv, swe): Swedish stemming
162
- **tamil** (ta, tam): Tamil stemming
163
- **turkish** (tr, tur): Turkish stemming
164
- **yiddish** (yi, yid): Yiddish stemming
165

166
### Algorithm Variants (2 additional)
167

168
- **porter**: Traditional Porter algorithm for English (variant of english)
169
- **dutch_porter**: Martin Porter's Dutch stemmer (variant of dutch)
170

171
## Character Encoding Support
172

173
- **UTF-8**: All 33 languages/variants
174
- **ISO-8859-1**: basque, catalan, danish, dutch, english, finnish, french, german, indonesian, irish, italian, norwegian, portuguese, spanish, swedish, porter, dutch_porter
175
- **ISO-8859-2**: hungarian
176
- **KOI8-R**: russian
177

178
## Performance Optimization
179

180
The library automatically uses C extensions when available for significant performance improvements. The `algorithms()` and `stemmer()` functions transparently choose the best available implementation:
181

182
```python
183
import snowballstemmer
184

185
# Automatically uses C extension if available, pure Python otherwise
186
stemmer = snowballstemmer.stemmer('english')
187

188
# Both implementations provide identical API
189
# C extension: faster performance
190
# Pure Python: broader compatibility
191
```
192

193
## Error Handling
194

195
```python
196
try:
197
    # Invalid language code
198
    stemmer = snowballstemmer.stemmer('klingon')
199
except KeyError as e:
200
    print(f"Language not supported: {e}")
201

202
# Safe language checking
203
available_langs = snowballstemmer.algorithms()
204
if 'german' in available_langs:
205
    german_stemmer = snowballstemmer.stemmer('german')
206
else:
207
    print("German stemming not available")
208
```
209

210
## Advanced Usage Examples
211

212
### Batch Processing
213

214
```python
215
import snowballstemmer
216

217
def process_multilingual_text(text_dict):
218
    """Process text in multiple languages."""
219
    results = {}
220
    
221
    for lang, words in text_dict.items():
222
        try:
223
            stemmer = snowballstemmer.stemmer(lang)
224
            results[lang] = stemmer.stemWords(words)
225
        except KeyError:
226
            print(f"Warning: Language '{lang}' not supported")
227
            results[lang] = words  # Return original words
228
    
229
    return results
230

231
# Example usage
232
texts = {
233
    'english': ['running', 'connection', 'easily'],
234
    'french': ['connexions', 'facilement', 'courant'],
235
    'spanish': ['corriendo', 'conexión', 'fácilmente']
236
}
237

238
stemmed_results = process_multilingual_text(texts)
239
for lang, words in stemmed_results.items():
240
    print(f"{lang}: {words}")
241
```
242

243
### Search Index Preparation
244

245
```python
246
import snowballstemmer
247
import re
248

249
class SearchIndexer:
250
    def __init__(self, language='english'):
251
        self.stemmer = snowballstemmer.stemmer(language)
252
        self.word_pattern = re.compile(r'\b\w+\b')
253
    
254
    def index_document(self, text):
255
        """Extract and stem words from document text."""
256
        words = self.word_pattern.findall(text.lower())
257
        return self.stemmer.stemWords(words)
258
    
259
    def normalize_query(self, query):
260
        """Normalize search query for matching."""
261
        words = self.word_pattern.findall(query.lower())
262
        return self.stemmer.stemWords(words)
263

264
# Example usage
265
indexer = SearchIndexer('english')
266
document = "The quick brown foxes are running through the connected fields"
267
query = "quick brown fox running connections"
268

269
doc_terms = indexer.index_document(document)
270
query_terms = indexer.normalize_query(query)
271

272
print(f"Document terms: {doc_terms}")
273
print(f"Query terms: {query_terms}")
274
# Both 'running' and 'connected'/'connections' will match their stemmed forms
275
```