0
# String Preprocessing
1
2
Utilities for normalizing and preprocessing strings before comparison. Proper string preprocessing can significantly improve matching accuracy by handling case differences, punctuation, and whitespace variations.
3
4
## Capabilities
5
6
### Default Processor
7
8
Standard preprocessing function that normalizes strings for comparison by converting to lowercase, removing non-alphanumeric characters, and trimming whitespace.
9
10
```python { .api }
11
def default_process(sentence: str) -> str
12
```
13
14
**Parameters:**
15
- `sentence`: Input string to preprocess
16
17
**Returns:** Normalized string with only lowercase alphanumeric characters and spaces
18
19
**Processing Steps:**
20
1. Converts to lowercase
21
2. Removes all non-alphanumeric characters (keeps letters, numbers, spaces)
22
3. Normalizes whitespace (multiple spaces become single spaces)
23
4. Trims leading and trailing whitespace
24
25
**Usage Example:**
26
```python
27
from rapidfuzz import utils
28
29
# Basic preprocessing
30
text = "Hello, World! 123"
31
processed = utils.default_process(text)
32
print(processed) # "hello world 123"
33
34
# Handling punctuation and case
35
text = "Don't worry, BE HAPPY!!!"
36
processed = utils.default_process(text)
37
print(processed) # "dont worry be happy"
38
39
# Normalizing whitespace
40
text = " multiple spaces "
41
processed = utils.default_process(text)
42
print(processed) # "multiple spaces"
43
44
# Unicode and special characters
45
text = "Café & Restaurant — $50"
46
processed = utils.default_process(text)
47
print(processed) # "cafe restaurant 50"
48
```
49
50
## Usage Patterns
51
52
### With Fuzz Functions
53
54
All fuzz functions accept an optional `processor` parameter:
55
56
```python
57
from rapidfuzz import fuzz, utils
58
59
s1 = "New York Jets"
60
s2 = "new york jets!!!"
61
62
# Without preprocessing
63
score1 = fuzz.ratio(s1, s2)
64
print(f"Raw: {score1:.1f}") # Lower score due to case/punctuation differences
65
66
# With preprocessing
67
score2 = fuzz.ratio(s1, s2, processor=utils.default_process)
68
print(f"Processed: {score2:.1f}") # 100.0 (perfect match after normalization)
69
70
# Works with all fuzz functions
71
score3 = fuzz.WRatio(s1, s2, processor=utils.default_process)
72
score4 = fuzz.token_sort_ratio(s1, s2, processor=utils.default_process)
73
score5 = fuzz.partial_ratio(s1, s2, processor=utils.default_process)
74
```
75
76
### With Process Functions
77
78
Process functions also support the `processor` parameter:
79
80
```python
81
from rapidfuzz import process, utils
82
83
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
84
query = "NEW YORK jets"
85
86
# Without preprocessing
87
match1 = process.extractOne(query, choices)
88
print(match1) # Lower score due to case differences
89
90
# With preprocessing
91
match2 = process.extractOne(query, choices, processor=utils.default_process)
92
print(match2) # Perfect match: ('New York Jets', 100.0, 1)
93
94
# Batch processing with preprocessing
95
matches = process.extract(query, choices,
96
processor=utils.default_process,
97
limit=3)
98
print(matches) # All matches benefit from normalization
99
```
100
101
### With Distance Metrics
102
103
Distance metrics also support preprocessing:
104
105
```python
106
from rapidfuzz.distance import Levenshtein
107
from rapidfuzz import utils
108
109
s1 = "Testing"
110
s2 = "TESTING!!!"
111
112
# Raw distance
113
dist1 = Levenshtein.distance(s1, s2)
114
print(f"Raw distance: {dist1}") # Higher due to case/punctuation
115
116
# With preprocessing
117
dist2 = Levenshtein.distance(s1, s2, processor=utils.default_process)
118
print(f"Processed distance: {dist2}") # 0 (identical after preprocessing)
119
```
120
121
### Custom Preprocessing
122
123
You can create custom preprocessing functions for specific needs:
124
125
```python
126
from rapidfuzz import fuzz
127
import re
128
129
def custom_preprocess(text):
130
"""Custom preprocessing for specific use case"""
131
# Convert to lowercase
132
text = text.lower()
133
# Remove only punctuation, keep numbers
134
text = re.sub(r'[^\w\s]', '', text)
135
# Normalize whitespace
136
text = ' '.join(text.split())
137
return text
138
139
def phone_preprocess(phone):
140
"""Preprocessing for phone numbers"""
141
# Remove all non-digits
142
return re.sub(r'[^\d]', '', phone)
143
144
# Use custom preprocessor
145
score = fuzz.ratio("Call (555) 123-4567", "555.123.4567",
146
processor=phone_preprocess)
147
print(score) # 100.0 (identical after removing formatting)
148
149
def name_preprocess(name):
150
"""Preprocessing for person names"""
151
import unicodedata
152
# Normalize unicode (handle accents)
153
name = unicodedata.normalize('NFKD', name)
154
name = ''.join(c for c in name if not unicodedata.combining(c))
155
# Convert to lowercase and remove punctuation
156
name = re.sub(r'[^\w\s]', '', name.lower())
157
# Handle common name variations
158
name = name.replace('jr', '').replace('sr', '').replace('iii', '').replace('ii', '')
159
return ' '.join(name.split())
160
161
# Better name matching
162
score = fuzz.ratio("José Martinez Jr.", "jose martinez",
163
processor=name_preprocess)
164
print(score) # High score despite accent and suffix differences
165
```
166
167
### When to Use Preprocessing
168
169
**Use preprocessing when:**
170
- Case differences are not meaningful ("Hello" vs "HELLO")
171
- Punctuation doesn't affect meaning ("don't" vs "dont")
172
- Whitespace variations occur ("New York" vs "New York")
173
- Special characters are formatting-only ("$100" vs "100")
174
- Unicode normalization needed ("café" vs "cafe")
175
176
**Don't use preprocessing when:**
177
- Case is semantically important (DNA sequences, product codes)
178
- Punctuation changes meaning ("can't" vs "cant", URLs)
179
- Exact character matching required (passwords, checksums)
180
- Performance is critical and strings already normalized
181
182
### Performance Considerations
183
184
```python
185
from rapidfuzz import process, utils
186
187
choices = ["..."] * 10000 # Large choice list
188
189
# Preprocessing overhead for single comparisons
190
query = "test query"
191
match = process.extractOne(query, choices, processor=utils.default_process)
192
193
# For repeated queries, preprocess choices once
194
processed_choices = [utils.default_process(choice) for choice in choices]
195
processed_query = utils.default_process(query)
196
197
# Now use without processor (already processed)
198
match = process.extractOne(processed_query, processed_choices)
199
200
# Or use a custom processor that caches results
201
preprocessing_cache = {}
202
203
def cached_preprocess(text):
204
if text not in preprocessing_cache:
205
preprocessing_cache[text] = utils.default_process(text)
206
return preprocessing_cache[text]
207
208
match = process.extractOne(query, choices, processor=cached_preprocess)
209
```
210
211
## Advanced Preprocessing Patterns
212
213
### Domain-Specific Preprocessing
214
215
```python
216
import re
217
from rapidfuzz import fuzz, utils
218
219
def address_preprocess(address):
220
"""Preprocessing for street addresses"""
221
address = address.lower()
222
# Standardize common abbreviations
223
address = re.sub(r'\bstreet\b|\bst\.?\b', 'st', address)
224
address = re.sub(r'\bavenue\b|\bave\.?\b', 'ave', address)
225
address = re.sub(r'\bblvd\.?\b|\bboulevard\b', 'blvd', address)
226
address = re.sub(r'\bdrive\b|\bdr\.?\b', 'dr', address)
227
# Remove apartment/suite numbers
228
address = re.sub(r'\b(apt|apartment|suite|ste|unit)\s*\w+', '', address)
229
return ' '.join(address.split())
230
231
def product_code_preprocess(code):
232
"""Preprocessing for product codes"""
233
# Remove common separators but keep structure
234
code = code.upper()
235
code = re.sub(r'[-_\s]', '', code)
236
return code
237
238
# Usage examples
239
addr1 = "123 Main Street, Apt 4B"
240
addr2 = "123 main st"
241
score = fuzz.ratio(addr1, addr2, processor=address_preprocess)
242
print(f"Address match: {score}") # High similarity
243
244
code1 = "ABC-123-XYZ"
245
code2 = "abc123xyz"
246
score = fuzz.ratio(code1, code2, processor=product_code_preprocess)
247
print(f"Product code match: {score}") # Perfect match
248
```
249
250
### Handling Non-ASCII Text
251
252
```python
253
import unicodedata
254
from rapidfuzz import fuzz
255
256
def unicode_normalize_preprocess(text):
257
"""Handle accented characters and unicode normalization"""
258
# Normalize unicode representation
259
text = unicodedata.normalize('NFKD', text)
260
# Remove combining characters (accents)
261
text = ''.join(c for c in text if not unicodedata.combining(c))
262
# Apply standard preprocessing
263
return utils.default_process(text)
264
265
# International text matching
266
text1 = "Café résumé naïve"
267
text2 = "cafe resume naive"
268
score = fuzz.ratio(text1, text2, processor=unicode_normalize_preprocess)
269
print(f"Unicode normalized: {score}") # Perfect match
270
```