0
# String Utilities
1
2
Utility functions for string preprocessing and normalization. These functions prepare strings for fuzzy matching by cleaning and standardizing their format.
3
4
## Capabilities
5
6
### Full String Processing
7
8
Comprehensive string preprocessing that normalizes text for optimal fuzzy matching performance.
9
10
```python { .api }
11
def full_process(s: str, force_ascii: bool = False) -> str:
12
"""
13
Process string for fuzzy matching by normalizing format.
14
15
Processing steps:
16
1. Convert to string if not already
17
2. Optionally convert to ASCII (removes accented characters)
18
3. Remove all non-alphanumeric characters (replaced with spaces)
19
4. Trim leading/trailing whitespace
20
5. Convert to lowercase
21
6. Normalize internal whitespace
22
23
Args:
24
s: String to process
25
force_ascii: If True, convert accented characters to ASCII equivalents
26
27
Returns:
28
str: Processed and normalized string
29
"""
30
```
31
32
### ASCII Conversion
33
34
Convert strings to ASCII-only by removing non-ASCII characters, useful for standardizing international text.
35
36
```python { .api }
37
def ascii_only(s: str) -> str:
38
"""
39
Convert string to ASCII by removing non-ASCII characters.
40
41
Removes characters with ASCII codes 128-255, effectively stripping
42
accented characters, emoji, and other non-ASCII content.
43
44
Args:
45
s: String to convert
46
47
Returns:
48
str: ASCII-only version of the string
49
"""
50
```
51
52
### Module Constants
53
54
```python { .api }
55
# Translation table for ASCII conversion (removes chars 128-255)
56
translation_table: dict
57
```
58
59
## Usage Examples
60
61
### Basic String Processing
62
63
```python
64
from thefuzz import utils
65
66
# Standard text normalization
67
text = " Hello, World! "
68
processed = utils.full_process(text)
69
print(processed) # "hello world"
70
71
# Handle special characters
72
text = "New York Mets vs. Atlanta Braves"
73
processed = utils.full_process(text)
74
print(processed) # "new york mets vs atlanta braves"
75
```
76
77
### ASCII Conversion
78
79
```python
80
from thefuzz import utils
81
82
# Convert accented characters
83
text = "Café Münchën"
84
ascii_text = utils.ascii_only(text)
85
print(ascii_text) # "Caf Mnchen"
86
87
# Full processing with ASCII conversion
88
processed = utils.full_process("Café Münchën", force_ascii=True)
89
print(processed) # "caf mnchen"
90
```
91
92
### Integration with Fuzzy Matching
93
94
```python
95
from thefuzz import fuzz, utils
96
97
# Manual preprocessing before comparison
98
s1 = utils.full_process("New York Mets!")
99
s2 = utils.full_process("new york mets")
100
score = fuzz.ratio(s1, s2)
101
print(score) # 100 (perfect match after processing)
102
103
# Compare with and without processing
104
raw_score = fuzz.ratio("New York Mets!", "new york mets")
105
processed_score = fuzz.ratio(
106
utils.full_process("New York Mets!"),
107
utils.full_process("new york mets")
108
)
109
print(f"Raw: {raw_score}, Processed: {processed_score}")
110
```
111
112
### Custom Processing Pipeline
113
114
```python
115
from thefuzz import utils
116
117
def custom_processor(text):
118
"""Custom processing for specific use case."""
119
# First apply standard processing
120
processed = utils.full_process(text, force_ascii=True)
121
122
# Add custom logic
123
# Remove common stop words, normalize abbreviations, etc.
124
replacements = {
125
"street": "st",
126
"avenue": "ave",
127
"boulevard": "blvd"
128
}
129
130
for old, new in replacements.items():
131
processed = processed.replace(old, new)
132
133
return processed
134
135
# Use with fuzzy matching
136
from thefuzz import process
137
138
addresses = ["123 Main Street", "456 Oak Avenue", "789 First Boulevard"]
139
result = process.extractOne("main st", addresses, processor=custom_processor)
140
```
141
142
### Performance Considerations
143
144
```python
145
from thefuzz import utils
146
147
# For batch processing, consider preprocessing once
148
texts = ["Text 1", "Text 2", "Text 3", ...]
149
processed_texts = [utils.full_process(text) for text in texts]
150
151
# Then use the processed texts for multiple comparisons
152
# This avoids repeated preprocessing in fuzzy matching functions
153
```
154
155
## Processing Behavior
156
157
### Character Handling
158
159
- **Alphanumeric**: Preserved (letters and numbers)
160
- **Whitespace**: Normalized (multiple spaces become single space, trimmed)
161
- **Punctuation**: Removed (replaced with spaces)
162
- **Accented characters**: Optionally converted to ASCII equivalents
163
- **Case**: Converted to lowercase
164
165
### Examples of Processing Results
166
167
```python
168
from thefuzz import utils
169
170
examples = [
171
"Hello, World!", # → "hello world"
172
" Multiple Spaces ", # → "multiple spaces"
173
"New York Mets vs. ATL", # → "new york mets vs atl"
174
"Café Münchën", # → "café münchën" (or "caf mnchen" with force_ascii=True)
175
"user@email.com", # → "user email com"
176
"1st & 2nd Avenue", # → "1st 2nd avenue"
177
]
178
179
for text in examples:
180
processed = utils.full_process(text)
181
processed_ascii = utils.full_process(text, force_ascii=True)
182
print(f"'{text}' → '{processed}' → '{processed_ascii}'")
183
```