or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-snowballstemmer

Comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/snowballstemmer@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-snowballstemmer@3.0.0

0

# Snowballstemmer

1

2

A comprehensive Python stemming library providing 74 stemmers for 31 languages generated from Snowball algorithms. Enables text processing applications to reduce words to their base forms for improved search and analysis, supporting multilingual text processing systems, search engines, and data analysis pipelines.

3

4

## Package Information

5

6

- **Package Name**: snowballstemmer

7

- **Language**: Python

8

- **Installation**: `pip install snowballstemmer`

9

- **License**: BSD-3-Clause

10

- **Version**: 3.0.1

11

12

## Core Imports

13

14

```python

15

import snowballstemmer

16

```

17

18

Access the public API functions:

19

20

```python

21

from snowballstemmer import algorithms, stemmer

22

```

23

24

## Basic Usage

25

26

```python

27

import snowballstemmer

28

29

# Get list of available languages

30

languages = snowballstemmer.algorithms()

31

print(f"Available languages: {len(languages)} total")

32

33

# Create a stemmer for English

34

stemmer = snowballstemmer.stemmer('english')

35

36

# Stem individual words

37

stemmed = stemmer.stemWord('running')

38

print(f"running -> {stemmed}") # prints: running -> run

39

40

# Stem multiple words at once

41

words = ['running', 'connected', 'connections', 'easily']

42

stemmed_words = stemmer.stemWords(words)

43

print(f"Original: {words}")

44

print(f"Stemmed: {stemmed_words}")

45

46

# Use different languages

47

french_stemmer = snowballstemmer.stemmer('french')

48

spanish_stemmer = snowballstemmer.stemmer('spanish')

49

50

print(f"French: 'connexions' -> {french_stemmer.stemWord('connexions')}")

51

print(f"Spanish: 'corriendo' -> {spanish_stemmer.stemWord('corriendo')}")

52

```

53

54

## Capabilities

55

56

### Language Discovery

57

58

Retrieve available stemming algorithms and supported languages.

59

60

```python { .api }

61

def algorithms():

62

"""

63

Get list of available stemming algorithm names.

64

65

Returns:

66

list: List of strings representing available language codes

67

68

Note:

69

Automatically returns C extension algorithms if available, otherwise pure Python algorithms.

70

This function checks for the Stemmer C extension and falls back gracefully.

71

"""

72

```

73

74

### Stemmer Factory

75

76

Create stemmer instances for specific languages with automatic fallback between C extension and pure Python implementations.

77

78

```python { .api }

79

def stemmer(lang):

80

"""

81

Create a stemmer instance for the specified language.

82

83

Parameters:

84

lang (str): Language code for desired stemming algorithm.

85

Supports multiple formats: 'english', 'en', 'eng'

86

87

Returns:

88

Stemmer: A stemmer instance with stemWord() and stemWords() methods

89

90

Raises:

91

KeyError: If stemming algorithm for language not found

92

93

Note:

94

Automatically uses C extension (Stemmer.Stemmer) if available,

95

otherwise falls back to pure Python implementation.

96

"""

97

```

98

99

### Word Stemming

100

101

Core stemming functionality for reducing words to their base forms. The stemmer instances returned by `stemmer()` provide these methods:

102

103

```python { .api }

104

# Stemmer instance methods (available on returned stemmer objects)

105

def stemWord(word):

106

"""

107

Stem a single word to its base form.

108

109

Parameters:

110

word (str): Word to stem

111

112

Returns:

113

str: Stemmed word

114

"""

115

116

def stemWords(words):

117

"""

118

Stem multiple words to their base forms.

119

120

Parameters:

121

words (list): List of words to stem

122

123

Returns:

124

list: List of stemmed words in same order

125

"""

126

```

127

128

129

## Supported Languages

130

131

Snowball stemmer supports 33 language algorithms across 31 languages, with multiple aliases for each.

132

133

### Primary Languages (31 total)

134

135

- **arabic** (ar, ara): Arabic stemming

136

- **armenian** (hy, hye, arm): Armenian stemming

137

- **basque** (eu, eus, baq): Basque stemming

138

- **catalan** (ca, cat): Catalan stemming

139

- **danish** (da, dan): Danish stemming

140

- **dutch** (nl, dut, nld, kraaij_pohlmann): Dutch stemming (Kraaij-Pohlmann algorithm)

141

- **english** (en, eng): English stemming

142

- **esperanto** (eo, epo): Esperanto stemming

143

- **estonian** (et, est): Estonian stemming

144

- **finnish** (fi, fin): Finnish stemming

145

- **french** (fr, fre, fra): French stemming

146

- **german** (de, ger, deu): German stemming

147

- **greek** (el, gre, ell): Greek stemming

148

- **hindi** (hi, hin): Hindi stemming

149

- **hungarian** (hu, hun): Hungarian stemming

150

- **indonesian** (id, ind): Indonesian stemming

151

- **irish** (ga, gle): Irish stemming

152

- **italian** (it, ita): Italian stemming

153

- **lithuanian** (lt, lit): Lithuanian stemming

154

- **nepali** (ne, nep): Nepali stemming

155

- **norwegian** (no, nor): Norwegian stemming

156

- **portuguese** (pt, por): Portuguese stemming

157

- **romanian** (ro, rum, ron): Romanian stemming

158

- **russian** (ru, rus): Russian stemming

159

- **serbian** (sr, srp): Serbian stemming

160

- **spanish** (es, esl, spa): Spanish stemming

161

- **swedish** (sv, swe): Swedish stemming

162

- **tamil** (ta, tam): Tamil stemming

163

- **turkish** (tr, tur): Turkish stemming

164

- **yiddish** (yi, yid): Yiddish stemming

165

166

### Algorithm Variants (2 additional)

167

168

- **porter**: Traditional Porter algorithm for English (variant of english)

169

- **dutch_porter**: Martin Porter's Dutch stemmer (variant of dutch)

170

171

## Character Encoding Support

172

173

- **UTF-8**: All 33 languages/variants

174

- **ISO-8859-1**: basque, catalan, danish, dutch, english, finnish, french, german, indonesian, irish, italian, norwegian, portuguese, spanish, swedish, porter, dutch_porter

175

- **ISO-8859-2**: hungarian

176

- **KOI8-R**: russian

177

178

## Performance Optimization

179

180

The library automatically uses C extensions when available for significant performance improvements. The `algorithms()` and `stemmer()` functions transparently choose the best available implementation:

181

182

```python

183

import snowballstemmer

184

185

# Automatically uses C extension if available, pure Python otherwise

186

stemmer = snowballstemmer.stemmer('english')

187

188

# Both implementations provide identical API

189

# C extension: faster performance

190

# Pure Python: broader compatibility

191

```

192

193

## Error Handling

194

195

```python

196

try:

197

# Invalid language code

198

stemmer = snowballstemmer.stemmer('klingon')

199

except KeyError as e:

200

print(f"Language not supported: {e}")

201

202

# Safe language checking

203

available_langs = snowballstemmer.algorithms()

204

if 'german' in available_langs:

205

german_stemmer = snowballstemmer.stemmer('german')

206

else:

207

print("German stemming not available")

208

```

209

210

## Advanced Usage Examples

211

212

### Batch Processing

213

214

```python

215

import snowballstemmer

216

217

def process_multilingual_text(text_dict):

218

"""Process text in multiple languages."""

219

results = {}

220

221

for lang, words in text_dict.items():

222

try:

223

stemmer = snowballstemmer.stemmer(lang)

224

results[lang] = stemmer.stemWords(words)

225

except KeyError:

226

print(f"Warning: Language '{lang}' not supported")

227

results[lang] = words # Return original words

228

229

return results

230

231

# Example usage

232

texts = {

233

'english': ['running', 'connection', 'easily'],

234

'french': ['connexions', 'facilement', 'courant'],

235

'spanish': ['corriendo', 'conexión', 'fácilmente']

236

}

237

238

stemmed_results = process_multilingual_text(texts)

239

for lang, words in stemmed_results.items():

240

print(f"{lang}: {words}")

241

```

242

243

### Search Index Preparation

244

245

```python

246

import snowballstemmer

247

import re

248

249

class SearchIndexer:

250

def __init__(self, language='english'):

251

self.stemmer = snowballstemmer.stemmer(language)

252

self.word_pattern = re.compile(r'\b\w+\b')

253

254

def index_document(self, text):

255

"""Extract and stem words from document text."""

256

words = self.word_pattern.findall(text.lower())

257

return self.stemmer.stemWords(words)

258

259

def normalize_query(self, query):

260

"""Normalize search query for matching."""

261

words = self.word_pattern.findall(query.lower())

262

return self.stemmer.stemWords(words)

263

264

# Example usage

265

indexer = SearchIndexer('english')

266

document = "The quick brown foxes are running through the connected fields"

267

query = "quick brown fox running connections"

268

269

doc_terms = indexer.index_document(document)

270

query_terms = indexer.normalize_query(query)

271

272

print(f"Document terms: {doc_terms}")

273

print(f"Query terms: {query_terms}")

274

# Both 'running' and 'connected'/'connections' will match their stemmed forms

275

```