Tessl Tile for pypi/rapidfuzz@3.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-processing.md distance-metrics.md fuzzy-matching.md index.md string-preprocessing.md

string-preprocessing.mddocs/

0
# String Preprocessing
1

2
Utilities for normalizing and preprocessing strings before comparison. Proper string preprocessing can significantly improve matching accuracy by handling case differences, punctuation, and whitespace variations.
3

4
## Capabilities
5

6
### Default Processor
7

8
Standard preprocessing function that normalizes strings for comparison by converting to lowercase, removing non-alphanumeric characters, and trimming whitespace.
9

10
```python { .api }
11
def default_process(sentence: str) -> str
12
```
13

14
**Parameters:**
15
- `sentence`: Input string to preprocess
16

17
**Returns:** Normalized string with only lowercase alphanumeric characters and spaces
18

19
**Processing Steps:**
20
1. Converts to lowercase
21
2. Removes all non-alphanumeric characters (keeps letters, numbers, spaces)
22
3. Normalizes whitespace (multiple spaces become single spaces)
23
4. Trims leading and trailing whitespace
24

25
**Usage Example:**
26
```python
27
from rapidfuzz import utils
28

29
# Basic preprocessing
30
text = "Hello, World! 123"
31
processed = utils.default_process(text)
32
print(processed)  # "hello world 123"
33

34
# Handling punctuation and case
35
text = "Don't worry, BE HAPPY!!!"
36
processed = utils.default_process(text)
37
print(processed)  # "dont worry be happy"
38

39
# Normalizing whitespace
40
text = "  multiple    spaces   "
41
processed = utils.default_process(text)
42
print(processed)  # "multiple spaces"
43

44
# Unicode and special characters
45
text = "Café & Restaurant — $50"
46
processed = utils.default_process(text)
47
print(processed)  # "cafe restaurant 50"
48
```
49

50
## Usage Patterns
51

52
### With Fuzz Functions
53

54
All fuzz functions accept an optional `processor` parameter:
55

56
```python
57
from rapidfuzz import fuzz, utils
58

59
s1 = "New York Jets"
60
s2 = "new york jets!!!"
61

62
# Without preprocessing
63
score1 = fuzz.ratio(s1, s2)
64
print(f"Raw: {score1:.1f}")  # Lower score due to case/punctuation differences
65

66
# With preprocessing  
67
score2 = fuzz.ratio(s1, s2, processor=utils.default_process)
68
print(f"Processed: {score2:.1f}")  # 100.0 (perfect match after normalization)
69

70
# Works with all fuzz functions
71
score3 = fuzz.WRatio(s1, s2, processor=utils.default_process)
72
score4 = fuzz.token_sort_ratio(s1, s2, processor=utils.default_process)
73
score5 = fuzz.partial_ratio(s1, s2, processor=utils.default_process)
74
```
75

76
### With Process Functions
77

78
Process functions also support the `processor` parameter:
79

80
```python
81
from rapidfuzz import process, utils
82

83
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
84
query = "NEW YORK jets"
85

86
# Without preprocessing
87
match1 = process.extractOne(query, choices)
88
print(match1)  # Lower score due to case differences
89

90
# With preprocessing
91
match2 = process.extractOne(query, choices, processor=utils.default_process)
92
print(match2)  # Perfect match: ('New York Jets', 100.0, 1)
93

94
# Batch processing with preprocessing
95
matches = process.extract(query, choices, 
96
                         processor=utils.default_process, 
97
                         limit=3)
98
print(matches)  # All matches benefit from normalization
99
```
100

101
### With Distance Metrics
102

103
Distance metrics also support preprocessing:
104

105
```python
106
from rapidfuzz.distance import Levenshtein
107
from rapidfuzz import utils
108

109
s1 = "Testing"
110
s2 = "TESTING!!!"
111

112
# Raw distance
113
dist1 = Levenshtein.distance(s1, s2)
114
print(f"Raw distance: {dist1}")  # Higher due to case/punctuation
115

116
# With preprocessing
117
dist2 = Levenshtein.distance(s1, s2, processor=utils.default_process)
118
print(f"Processed distance: {dist2}")  # 0 (identical after preprocessing)
119
```
120

121
### Custom Preprocessing
122

123
You can create custom preprocessing functions for specific needs:
124

125
```python
126
from rapidfuzz import fuzz
127
import re
128

129
def custom_preprocess(text):
130
    """Custom preprocessing for specific use case"""
131
    # Convert to lowercase
132
    text = text.lower()
133
    # Remove only punctuation, keep numbers
134
    text = re.sub(r'[^\w\s]', '', text)  
135
    # Normalize whitespace
136
    text = ' '.join(text.split())
137
    return text
138

139
def phone_preprocess(phone):
140
    """Preprocessing for phone numbers"""
141
    # Remove all non-digits
142
    return re.sub(r'[^\d]', '', phone)
143

144
# Use custom preprocessor
145
score = fuzz.ratio("Call (555) 123-4567", "555.123.4567", 
146
                  processor=phone_preprocess)
147
print(score)  # 100.0 (identical after removing formatting)
148

149
def name_preprocess(name):
150
    """Preprocessing for person names"""
151
    import unicodedata
152
    # Normalize unicode (handle accents)
153
    name = unicodedata.normalize('NFKD', name)
154
    name = ''.join(c for c in name if not unicodedata.combining(c))
155
    # Convert to lowercase and remove punctuation
156
    name = re.sub(r'[^\w\s]', '', name.lower())
157
    # Handle common name variations
158
    name = name.replace('jr', '').replace('sr', '').replace('iii', '').replace('ii', '')
159
    return ' '.join(name.split())
160

161
# Better name matching
162
score = fuzz.ratio("José Martinez Jr.", "jose martinez", 
163
                  processor=name_preprocess)
164
print(score)  # High score despite accent and suffix differences
165
```
166

167
### When to Use Preprocessing
168

169
**Use preprocessing when:**
170
- Case differences are not meaningful ("Hello" vs "HELLO")
171
- Punctuation doesn't affect meaning ("don't" vs "dont")  
172
- Whitespace variations occur ("New York" vs "New  York")
173
- Special characters are formatting-only ("$100" vs "100")
174
- Unicode normalization needed ("café" vs "cafe")
175

176
**Don't use preprocessing when:**
177
- Case is semantically important (DNA sequences, product codes)
178
- Punctuation changes meaning ("can't" vs "cant", URLs)
179
- Exact character matching required (passwords, checksums)
180
- Performance is critical and strings already normalized
181

182
### Performance Considerations
183

184
```python
185
from rapidfuzz import process, utils
186

187
choices = ["..."] * 10000  # Large choice list
188

189
# Preprocessing overhead for single comparisons
190
query = "test query"
191
match = process.extractOne(query, choices, processor=utils.default_process)
192

193
# For repeated queries, preprocess choices once
194
processed_choices = [utils.default_process(choice) for choice in choices]
195
processed_query = utils.default_process(query)
196

197
# Now use without processor (already processed)
198
match = process.extractOne(processed_query, processed_choices)
199

200
# Or use a custom processor that caches results
201
preprocessing_cache = {}
202

203
def cached_preprocess(text):
204
    if text not in preprocessing_cache:
205
        preprocessing_cache[text] = utils.default_process(text)
206
    return preprocessing_cache[text]
207

208
match = process.extractOne(query, choices, processor=cached_preprocess)
209
```
210

211
## Advanced Preprocessing Patterns
212

213
### Domain-Specific Preprocessing
214

215
```python
216
import re
217
from rapidfuzz import fuzz, utils
218

219
def address_preprocess(address):
220
    """Preprocessing for street addresses"""
221
    address = address.lower()
222
    # Standardize common abbreviations
223
    address = re.sub(r'\bstreet\b|\bst\.?\b', 'st', address)
224
    address = re.sub(r'\bavenue\b|\bave\.?\b', 'ave', address) 
225
    address = re.sub(r'\bblvd\.?\b|\bboulevard\b', 'blvd', address)
226
    address = re.sub(r'\bdrive\b|\bdr\.?\b', 'dr', address)
227
    # Remove apartment/suite numbers
228
    address = re.sub(r'\b(apt|apartment|suite|ste|unit)\s*\w+', '', address)
229
    return ' '.join(address.split())
230

231
def product_code_preprocess(code):
232
    """Preprocessing for product codes"""
233
    # Remove common separators but keep structure
234
    code = code.upper()
235
    code = re.sub(r'[-_\s]', '', code)
236
    return code
237

238
# Usage examples
239
addr1 = "123 Main Street, Apt 4B"
240
addr2 = "123 main st"
241
score = fuzz.ratio(addr1, addr2, processor=address_preprocess)
242
print(f"Address match: {score}")  # High similarity
243

244
code1 = "ABC-123-XYZ"
245
code2 = "abc123xyz"
246
score = fuzz.ratio(code1, code2, processor=product_code_preprocess)
247
print(f"Product code match: {score}")  # Perfect match
248
```
249

250
### Handling Non-ASCII Text
251

252
```python
253
import unicodedata
254
from rapidfuzz import fuzz
255

256
def unicode_normalize_preprocess(text):
257
    """Handle accented characters and unicode normalization"""
258
    # Normalize unicode representation
259
    text = unicodedata.normalize('NFKD', text)
260
    # Remove combining characters (accents)
261
    text = ''.join(c for c in text if not unicodedata.combining(c))
262
    # Apply standard preprocessing
263
    return utils.default_process(text)
264

265
# International text matching
266
text1 = "Café résumé naïve"
267
text2 = "cafe resume naive"
268
score = fuzz.ratio(text1, text2, processor=unicode_normalize_preprocess)
269
print(f"Unicode normalized: {score}")  # Perfect match
270
```

Version

Tile

Files

string-preprocessing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

string-preprocessing.mddocs/