0
# Utilities and Helpers
1
2
String preprocessing, validation functions, and utility classes for handling edge cases and optimizing fuzzy string matching operations.
3
4
## Capabilities
5
6
### String Processing and Normalization
7
8
Core string preprocessing function that normalizes strings for consistent fuzzy matching.
9
10
```python { .api }
11
def full_process(s: str, force_ascii: bool = False) -> str:
12
"""
13
Process string by removing non-alphanumeric characters, trimming, and lowercasing.
14
15
Processing steps:
16
1. Convert to ASCII if force_ascii=True
17
2. Replace non-letters and non-numbers with whitespace
18
3. Convert to lowercase
19
4. Strip leading and trailing whitespace
20
21
Parameters:
22
s: String to process
23
force_ascii: Force conversion to ASCII, removing non-ASCII characters
24
25
Returns:
26
str: Processed string ready for fuzzy matching
27
"""
28
```
29
30
**Usage Example:**
31
```python
32
from fuzzywuzzy import utils
33
34
# Standard processing
35
processed = utils.full_process(" Hello, World! 123 ")
36
print(processed) # "hello world 123"
37
38
# With ASCII forcing
39
processed = utils.full_process("Café naïve résumé", force_ascii=True)
40
print(processed) # "caf naive resume"
41
42
# Remove punctuation and normalize
43
processed = utils.full_process("user@example.com")
44
print(processed) # "user example com"
45
```
46
47
### String Validation
48
49
Validate that strings are suitable for fuzzy matching operations.
50
51
```python { .api }
52
def validate_string(s) -> bool:
53
"""
54
Check input has length and that length > 0.
55
56
Parameters:
57
s: Input to validate
58
59
Returns:
60
bool: True if len(s) > 0, False otherwise or if TypeError
61
"""
62
```
63
64
**Usage Example:**
65
```python
66
from fuzzywuzzy import utils
67
68
print(utils.validate_string("hello")) # True
69
print(utils.validate_string("")) # False
70
print(utils.validate_string(None)) # False
71
print(utils.validate_string(123)) # False (TypeError)
72
```
73
74
### Type Consistency
75
76
Ensure both strings are the same type (str or unicode) for consistent comparison.
77
78
```python { .api }
79
def make_type_consistent(s1, s2):
80
"""
81
If both objects aren't either both string or unicode instances, force them to unicode.
82
83
Parameters:
84
s1: First string
85
s2: Second string
86
87
Returns:
88
tuple: (s1, s2) with consistent types
89
"""
90
```
91
92
### ASCII Conversion Functions
93
94
Functions for handling ASCII conversion and character filtering.
95
96
```python { .api }
97
def asciidammit(s) -> str:
98
"""
99
Force string to ASCII by removing or converting non-ASCII characters.
100
101
Parameters:
102
s: String to convert
103
104
Returns:
105
str: ASCII-only version of input string
106
"""
107
108
def asciionly(s) -> str:
109
"""
110
Remove non-ASCII characters from string.
111
112
Parameters:
113
s: String to filter
114
115
Returns:
116
str: String with non-ASCII characters removed
117
"""
118
```
119
120
**Usage Example:**
121
```python
122
from fuzzywuzzy import utils
123
124
# Force ASCII conversion
125
ascii_str = utils.asciidammit("Café naïve résumé")
126
print(ascii_str) # "Caf naive resume"
127
128
# Remove non-ASCII only
129
filtered = utils.asciionly("Hello 世界")
130
print(filtered) # "Hello "
131
```
132
133
### Mathematical Utilities
134
135
Helper functions for numerical operations in fuzzy matching.
136
137
```python { .api }
138
def intr(n) -> int:
139
"""
140
Return a correctly rounded integer.
141
142
Parameters:
143
n: Number to round
144
145
Returns:
146
int: Rounded integer value
147
"""
148
```
149
150
**Usage Example:**
151
```python
152
from fuzzywuzzy import utils
153
154
print(utils.intr(97.6)) # 98
155
print(utils.intr(97.4)) # 97
156
print(utils.intr(97.5)) # 98
157
```
158
159
## StringProcessor Class
160
161
Advanced string processing utilities with optimized methods.
162
163
```python { .api }
164
class StringProcessor:
165
"""
166
String processing utilities class with efficient methods for
167
text normalization and cleaning operations.
168
"""
169
170
@classmethod
171
def replace_non_letters_non_numbers_with_whitespace(cls, a_string: str) -> str:
172
"""
173
Replace any sequence of non-letters and non-numbers with single whitespace.
174
175
Parameters:
176
a_string: String to process
177
178
Returns:
179
str: String with non-alphanumeric sequences replaced by spaces
180
"""
181
182
@staticmethod
183
def strip(s: str) -> str:
184
"""Remove leading and trailing whitespace."""
185
186
@staticmethod
187
def to_lower_case(s: str) -> str:
188
"""Convert string to lowercase."""
189
190
@staticmethod
191
def to_upper_case(s: str) -> str:
192
"""Convert string to uppercase."""
193
```
194
195
**Usage Example:**
196
```python
197
from fuzzywuzzy.string_processing import StringProcessor
198
199
# Advanced string processing
200
text = "Hello!!! @#$ World??? 123"
201
processed = StringProcessor.replace_non_letters_non_numbers_with_whitespace(text)
202
print(processed) # "Hello World 123"
203
204
# Standard operations
205
lower_text = StringProcessor.to_lower_case("HELLO WORLD")
206
print(lower_text) # "hello world"
207
208
stripped = StringProcessor.strip(" hello world ")
209
print(stripped) # "hello world"
210
```
211
212
## StringMatcher Class (Optional)
213
214
High-performance string matching class available when python-Levenshtein is installed.
215
216
```python { .api }
217
class StringMatcher:
218
"""
219
A SequenceMatcher-like class built on top of Levenshtein distance calculations.
220
Provides significant performance improvements when python-Levenshtein is available.
221
222
This class provides a SequenceMatcher-compatible interface while using the
223
highly optimized Levenshtein library for calculations.
224
"""
225
226
def __init__(self, isjunk=None, seq1: str = '', seq2: str = ''):
227
"""
228
Initialize StringMatcher with two sequences.
229
230
Parameters:
231
isjunk: Junk function (ignored, not implemented - will show warning)
232
seq1: First string to compare (default: '')
233
seq2: Second string to compare (default: '')
234
"""
235
236
def set_seqs(self, seq1: str, seq2: str):
237
"""
238
Set both sequences for comparison and reset cache.
239
240
Parameters:
241
seq1: First string to compare
242
seq2: Second string to compare
243
"""
244
245
def set_seq1(self, seq1: str):
246
"""
247
Set first sequence and reset cache.
248
249
Parameters:
250
seq1: First string to compare
251
"""
252
253
def set_seq2(self, seq2: str):
254
"""
255
Set second sequence and reset cache.
256
257
Parameters:
258
seq2: Second string to compare
259
"""
260
261
def ratio(self) -> float:
262
"""
263
Get similarity ratio between sequences using Levenshtein calculation.
264
265
Returns:
266
float: Similarity ratio between 0.0 and 1.0
267
"""
268
269
def quick_ratio(self) -> float:
270
"""
271
Get quick similarity ratio (same as ratio() in this implementation).
272
273
Returns:
274
float: Similarity ratio between 0.0 and 1.0
275
"""
276
277
def real_quick_ratio(self) -> float:
278
"""
279
Get a very quick similarity estimate based on string lengths.
280
281
Returns:
282
float: Quick similarity estimate between 0.0 and 1.0
283
"""
284
285
def distance(self) -> int:
286
"""
287
Get Levenshtein distance between sequences.
288
289
Returns:
290
int: Edit distance (number of operations to transform seq1 to seq2)
291
"""
292
293
def get_opcodes(self):
294
"""
295
Get operation codes for sequence comparison.
296
297
Returns:
298
List of operation codes compatible with difflib.SequenceMatcher
299
"""
300
301
def get_editops(self):
302
"""
303
Get edit operations for transforming one sequence to another.
304
305
Returns:
306
List of edit operations (insertions, deletions, substitutions)
307
"""
308
309
def get_matching_blocks(self):
310
"""
311
Get matching blocks between sequences.
312
313
Returns:
314
List of matching blocks compatible with difflib.SequenceMatcher
315
"""
316
```
317
318
**Usage Example:**
319
```python
320
# Only available if python-Levenshtein is installed
321
try:
322
from fuzzywuzzy.StringMatcher import StringMatcher
323
324
matcher = StringMatcher(seq1="hello world", seq2="hallo world")
325
ratio = matcher.ratio()
326
print(f"Similarity: {ratio}") # High-performance ratio calculation
327
328
distance = matcher.distance()
329
print(f"Edit distance: {distance}") # Levenshtein distance
330
331
except ImportError:
332
print("python-Levenshtein not installed, using standard algorithms")
333
```
334
335
## Constants and Configuration
336
337
Internal constants used by fuzzywuzzy for compatibility and character handling.
338
339
```python { .api }
340
PY3: bool # True if running Python 3, False for Python 2
341
bad_chars: str # String containing ASCII characters 128-256 for filtering
342
translation_table: dict # Translation table for removing non-ASCII chars (Python 3 only)
343
unicode: type # str type in Python 3, unicode type in Python 2
344
```
345
346
**Usage Example:**
347
```python
348
from fuzzywuzzy import utils
349
350
# Check Python version
351
if utils.PY3:
352
print("Running on Python 3")
353
else:
354
print("Running on Python 2")
355
356
# Access character filtering components
357
print(f"Bad chars string length: {len(utils.bad_chars)}") # 128 characters
358
```
359
360
## Decorator Functions (Internal)
361
362
These decorators are used internally by fuzzywuzzy but can be useful for custom scoring functions. They handle common edge cases in string comparison.
363
364
```python { .api }
365
def check_for_equivalence(func):
366
"""
367
Decorator that returns 100 if both input strings are identical.
368
369
This decorator checks if args[0] == args[1] and returns 100 (perfect match)
370
if they are equal, otherwise calls the decorated function.
371
372
Parameters:
373
func: Function to decorate that takes two string arguments
374
375
Returns:
376
function: Decorated function that handles string equivalence
377
"""
378
379
def check_for_none(func):
380
"""
381
Decorator that returns 0 if either input string is None.
382
383
This decorator checks if args[0] or args[1] is None and returns 0
384
(no match) if either is None, otherwise calls the decorated function.
385
386
Parameters:
387
func: Function to decorate that takes two string arguments
388
389
Returns:
390
function: Decorated function that handles None inputs
391
"""
392
393
def check_empty_string(func):
394
"""
395
Decorator that returns 0 if either input string is empty.
396
397
This decorator checks if len(args[0]) == 0 or len(args[1]) == 0 and
398
returns 0 (no match) if either is empty, otherwise calls the decorated function.
399
400
Parameters:
401
func: Function to decorate that takes two string arguments
402
403
Returns:
404
function: Decorated function that handles empty string inputs
405
"""
406
```
407
408
**Usage Example:**
409
```python
410
from fuzzywuzzy import utils
411
412
@utils.check_for_none
413
@utils.check_for_equivalence
414
def custom_scorer(s1, s2):
415
# Your custom scoring logic here
416
return 50 # Example score
417
418
# Decorators handle edge cases automatically
419
print(custom_scorer("hello", "hello")) # 100 (equivalence)
420
print(custom_scorer("hello", None)) # 0 (none check)
421
print(custom_scorer("hello", "world")) # 50 (custom logic)
422
```