or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# PyStemmer

1

2

PyStemmer provides access to efficient algorithms for calculating a "stemmed" form of a word by wrapping the libstemmer library from the Snowball project in a Python module. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing "cycling" given the query "cycles".

3

4

## Package Information

5

6

- **Package Name**: PyStemmer

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install pystemmer`

10

- **License**: MIT, BSD

11

12

## Core Imports

13

14

```python

15

import Stemmer

16

```

17

18

## Basic Usage

19

20

```python

21

import Stemmer

22

23

# Get list of available algorithms

24

algorithms = Stemmer.algorithms()

25

print(algorithms) # ['arabic', 'armenian', 'basque', 'catalan', ...]

26

27

# Create a stemmer instance for English

28

stemmer = Stemmer.Stemmer('english')

29

30

# Stem a single word

31

stemmed = stemmer.stemWord('cycling')

32

print(stemmed) # 'cycl'

33

34

# Stem multiple words

35

stems = stemmer.stemWords(['cycling', 'cyclist', 'cycles'])

36

print(stems) # ['cycl', 'cyclist', 'cycl']

37

38

# Configure cache size (default: 10000)

39

stemmer.maxCacheSize = 5000

40

41

# Disable cache entirely

42

stemmer.maxCacheSize = 0

43

```

44

45

## Architecture

46

47

PyStemmer wraps the libstemmer_c library through Cython extensions for high performance. The `Stemmer` class maintains internal state including a cache for improved performance on repeated words. Each stemmer instance is tied to a specific language algorithm but is not thread-safe - create separate instances for concurrent use.

48

49

## Capabilities

50

51

### Algorithm Discovery

52

53

Query available stemming algorithms and get version information.

54

55

```python { .api }

56

def algorithms(aliases=False):

57

"""

58

Get a list of the names of the available stemming algorithms.

59

60

Args:

61

aliases (bool, optional): If False (default), returns only canonical

62

algorithm names; if True, includes aliases

63

64

Returns:

65

list: List of strings containing algorithm names

66

"""

67

68

def version():

69

"""

70

Get the version string of the stemming module.

71

72

Note: This returns the internal libstemmer version (currently '2.0.1'),

73

which may differ from the PyStemmer package version.

74

75

Returns:

76

str: Version string for the internal stemmer module

77

"""

78

```

79

80

### Stemmer Class

81

82

Core stemming functionality with caching support for high performance.

83

84

```python { .api }

85

class Stemmer:

86

def __init__(self, algorithm, maxCacheSize=10000):

87

"""

88

Initialize a stemmer for the specified algorithm.

89

90

Args:

91

algorithm (str): Name of stemming algorithm to use (from algorithms() list)

92

maxCacheSize (int, optional): Maximum cache size, default 10000,

93

set to 0 to disable cache

94

95

Raises:

96

KeyError: If algorithm not found

97

"""

98

99

@property

100

def maxCacheSize(self):

101

"""

102

Maximum number of entries to allow in the cache.

103

104

This may be set to zero to disable the cache entirely.

105

The maximum cache size may be set at any point. Setting a smaller

106

maximum size will trigger cache purging using an LRU-style algorithm

107

that removes less recently used entries.

108

109

Returns:

110

int: Current maximum cache size

111

"""

112

113

@maxCacheSize.setter

114

def maxCacheSize(self, size):

115

"""

116

Set maximum cache size.

117

118

Args:

119

size (int): New maximum size (0 disables cache completely)

120

"""

121

122

def stemWord(self, word):

123

"""

124

Stem a single word.

125

126

Args:

127

word (str or unicode): Word to stem, UTF-8 encoded string or unicode object

128

129

Returns:

130

str or unicode: Stemmed word (same type as input)

131

"""

132

133

def stemWords(self, words):

134

"""

135

Stem a sequence of words.

136

137

Args:

138

words (sequence): Sequence, iterator, or generator of words to stem

139

140

Returns:

141

list: List of stemmed words (preserves individual word encoding types)

142

"""

143

```

144

145

## Supported Languages

146

147

PyStemmer supports 25+ languages through the Snowball algorithms:

148

149

- **European**: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Romanian, Hungarian, Greek, Catalan, Basque, Irish, Lithuanian

150

- **Middle Eastern**: Arabic, Turkish

151

- **Asian**: Hindi, Indonesian, Nepali, Tamil

152

- **Other**: Yiddish, Serbian, Armenian, Esperanto

153

154

Special algorithms:

155

- **porter**: Classic Porter stemming algorithm for English (for research/compatibility)

156

- **english**: Improved Snowball English algorithm (recommended for most users)

157

158

## Thread Safety

159

160

Stemmer instances are **not thread-safe**. For concurrent processing:

161

162

1. **Recommended**: Create separate `Stemmer` instances for each thread

163

2. **Alternative**: Use threading locks to protect shared stemmer access (may reduce performance)

164

165

The stemmer code itself is re-entrant, so multiple instances can run concurrently without issues.

166

167

## Performance

168

169

- Built on high-performance C extensions via Cython

170

- Internal caching significantly improves performance for repeated words

171

- Default cache size (10000) is optimized for typical text processing

172

- Cache can be tuned or disabled based on usage patterns

173

- Reuse stemmer instances rather than creating new ones for each operation

174

175

### Cache Behavior

176

177

The internal cache uses an LRU-style purging strategy:

178

- When cache size exceeds `maxCacheSize`, older entries are purged

179

- Purging retains approximately 80% of the most recently used entries

180

- Each word access updates its usage counter for LRU tracking

181

- Cache lookups and updates happen automatically during stemming operations

182

183

## Error Handling

184

185

- `KeyError`: Raised when creating stemmer with unknown algorithm name

186

- Use `Stemmer.algorithms()` to get list of valid algorithm names

187

- Input encoding is handled automatically (UTF-8 strings and unicode objects supported)

188

- No exceptions raised for empty strings or None inputs - they are processed normally

189

- Cache operations are transparent and do not raise exceptions under normal conditions