Tessl Tile for pypi/pystemmer@3.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# PyStemmer
1

2
PyStemmer provides access to efficient algorithms for calculating a "stemmed" form of a word by wrapping the libstemmer library from the Snowball project in a Python module. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing "cycling" given the query "cycles".
3

4
## Package Information
5

6
- **Package Name**: PyStemmer
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install pystemmer`
10
- **License**: MIT, BSD
11

12
## Core Imports
13

14
```python
15
import Stemmer
16
```
17

18
## Basic Usage
19

20
```python
21
import Stemmer
22

23
# Get list of available algorithms
24
algorithms = Stemmer.algorithms()
25
print(algorithms)  # ['arabic', 'armenian', 'basque', 'catalan', ...]
26

27
# Create a stemmer instance for English
28
stemmer = Stemmer.Stemmer('english')
29

30
# Stem a single word
31
stemmed = stemmer.stemWord('cycling')
32
print(stemmed)  # 'cycl'
33

34
# Stem multiple words
35
stems = stemmer.stemWords(['cycling', 'cyclist', 'cycles'])
36
print(stems)  # ['cycl', 'cyclist', 'cycl']
37

38
# Configure cache size (default: 10000)
39
stemmer.maxCacheSize = 5000
40

41
# Disable cache entirely
42
stemmer.maxCacheSize = 0
43
```
44

45
## Architecture
46

47
PyStemmer wraps the libstemmer_c library through Cython extensions for high performance. The `Stemmer` class maintains internal state including a cache for improved performance on repeated words. Each stemmer instance is tied to a specific language algorithm but is not thread-safe - create separate instances for concurrent use.
48

49
## Capabilities
50

51
### Algorithm Discovery
52

53
Query available stemming algorithms and get version information.
54

55
```python { .api }
56
def algorithms(aliases=False):
57
    """
58
    Get a list of the names of the available stemming algorithms.
59
    
60
    Args:
61
        aliases (bool, optional): If False (default), returns only canonical 
62
            algorithm names; if True, includes aliases
63
    
64
    Returns:
65
        list: List of strings containing algorithm names
66
    """
67

68
def version():
69
    """
70
    Get the version string of the stemming module.
71
    
72
    Note: This returns the internal libstemmer version (currently '2.0.1'),
73
    which may differ from the PyStemmer package version.
74
    
75
    Returns:
76
        str: Version string for the internal stemmer module
77
    """
78
```
79

80
### Stemmer Class
81

82
Core stemming functionality with caching support for high performance.
83

84
```python { .api }
85
class Stemmer:
86
    def __init__(self, algorithm, maxCacheSize=10000):
87
        """
88
        Initialize a stemmer for the specified algorithm.
89
        
90
        Args:
91
            algorithm (str): Name of stemming algorithm to use (from algorithms() list)
92
            maxCacheSize (int, optional): Maximum cache size, default 10000, 
93
                set to 0 to disable cache
94
        
95
        Raises:
96
            KeyError: If algorithm not found
97
        """
98
    
99
    @property
100
    def maxCacheSize(self):
101
        """
102
        Maximum number of entries to allow in the cache.
103
        
104
        This may be set to zero to disable the cache entirely.
105
        The maximum cache size may be set at any point. Setting a smaller
106
        maximum size will trigger cache purging using an LRU-style algorithm
107
        that removes less recently used entries.
108
        
109
        Returns:
110
            int: Current maximum cache size
111
        """
112
    
113
    @maxCacheSize.setter  
114
    def maxCacheSize(self, size):
115
        """
116
        Set maximum cache size.
117
        
118
        Args:
119
            size (int): New maximum size (0 disables cache completely)
120
        """
121
        
122
    def stemWord(self, word):
123
        """
124
        Stem a single word.
125
        
126
        Args:
127
            word (str or unicode): Word to stem, UTF-8 encoded string or unicode object
128
        
129
        Returns:
130
            str or unicode: Stemmed word (same type as input)
131
        """
132
    
133
    def stemWords(self, words):
134
        """
135
        Stem a sequence of words.
136
        
137
        Args:
138
            words (sequence): Sequence, iterator, or generator of words to stem
139
        
140
        Returns:
141
            list: List of stemmed words (preserves individual word encoding types)
142
        """
143
```
144

145
## Supported Languages
146

147
PyStemmer supports 25+ languages through the Snowball algorithms:
148

149
- **European**: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Romanian, Hungarian, Greek, Catalan, Basque, Irish, Lithuanian
150
- **Middle Eastern**: Arabic, Turkish  
151
- **Asian**: Hindi, Indonesian, Nepali, Tamil
152
- **Other**: Yiddish, Serbian, Armenian, Esperanto
153

154
Special algorithms:
155
- **porter**: Classic Porter stemming algorithm for English (for research/compatibility)
156
- **english**: Improved Snowball English algorithm (recommended for most users)
157

158
## Thread Safety
159

160
Stemmer instances are **not thread-safe**. For concurrent processing:
161

162
1. **Recommended**: Create separate `Stemmer` instances for each thread
163
2. **Alternative**: Use threading locks to protect shared stemmer access (may reduce performance)
164

165
The stemmer code itself is re-entrant, so multiple instances can run concurrently without issues.
166

167
## Performance
168

169
- Built on high-performance C extensions via Cython
170
- Internal caching significantly improves performance for repeated words
171
- Default cache size (10000) is optimized for typical text processing
172
- Cache can be tuned or disabled based on usage patterns
173
- Reuse stemmer instances rather than creating new ones for each operation
174

175
### Cache Behavior
176

177
The internal cache uses an LRU-style purging strategy:
178
- When cache size exceeds `maxCacheSize`, older entries are purged
179
- Purging retains approximately 80% of the most recently used entries
180
- Each word access updates its usage counter for LRU tracking
181
- Cache lookups and updates happen automatically during stemming operations
182

183
## Error Handling
184

185
- `KeyError`: Raised when creating stemmer with unknown algorithm name
186
- Use `Stemmer.algorithms()` to get list of valid algorithm names
187
- Input encoding is handled automatically (UTF-8 strings and unicode objects supported)
188
- No exceptions raised for empty strings or None inputs - they are processed normally
189
- Cache operations are transparent and do not raise exceptions under normal conditions

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/