or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-processing.mddistance-metrics.mdfuzzy-matching.mdindex.mdstring-preprocessing.md

string-preprocessing.mddocs/

0

# String Preprocessing

1

2

Utilities for normalizing and preprocessing strings before comparison. Proper string preprocessing can significantly improve matching accuracy by handling case differences, punctuation, and whitespace variations.

3

4

## Capabilities

5

6

### Default Processor

7

8

Standard preprocessing function that normalizes strings for comparison by converting to lowercase, removing non-alphanumeric characters, and trimming whitespace.

9

10

```python { .api }

11

def default_process(sentence: str) -> str

12

```

13

14

**Parameters:**

15

- `sentence`: Input string to preprocess

16

17

**Returns:** Normalized string with only lowercase alphanumeric characters and spaces

18

19

**Processing Steps:**

20

1. Converts to lowercase

21

2. Removes all non-alphanumeric characters (keeps letters, numbers, spaces)

22

3. Normalizes whitespace (multiple spaces become single spaces)

23

4. Trims leading and trailing whitespace

24

25

**Usage Example:**

26

```python

27

from rapidfuzz import utils

28

29

# Basic preprocessing

30

text = "Hello, World! 123"

31

processed = utils.default_process(text)

32

print(processed) # "hello world 123"

33

34

# Handling punctuation and case

35

text = "Don't worry, BE HAPPY!!!"

36

processed = utils.default_process(text)

37

print(processed) # "dont worry be happy"

38

39

# Normalizing whitespace

40

text = " multiple spaces "

41

processed = utils.default_process(text)

42

print(processed) # "multiple spaces"

43

44

# Unicode and special characters

45

text = "Café & Restaurant — $50"

46

processed = utils.default_process(text)

47

print(processed) # "cafe restaurant 50"

48

```

49

50

## Usage Patterns

51

52

### With Fuzz Functions

53

54

All fuzz functions accept an optional `processor` parameter:

55

56

```python

57

from rapidfuzz import fuzz, utils

58

59

s1 = "New York Jets"

60

s2 = "new york jets!!!"

61

62

# Without preprocessing

63

score1 = fuzz.ratio(s1, s2)

64

print(f"Raw: {score1:.1f}") # Lower score due to case/punctuation differences

65

66

# With preprocessing

67

score2 = fuzz.ratio(s1, s2, processor=utils.default_process)

68

print(f"Processed: {score2:.1f}") # 100.0 (perfect match after normalization)

69

70

# Works with all fuzz functions

71

score3 = fuzz.WRatio(s1, s2, processor=utils.default_process)

72

score4 = fuzz.token_sort_ratio(s1, s2, processor=utils.default_process)

73

score5 = fuzz.partial_ratio(s1, s2, processor=utils.default_process)

74

```

75

76

### With Process Functions

77

78

Process functions also support the `processor` parameter:

79

80

```python

81

from rapidfuzz import process, utils

82

83

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

84

query = "NEW YORK jets"

85

86

# Without preprocessing

87

match1 = process.extractOne(query, choices)

88

print(match1) # Lower score due to case differences

89

90

# With preprocessing

91

match2 = process.extractOne(query, choices, processor=utils.default_process)

92

print(match2) # Perfect match: ('New York Jets', 100.0, 1)

93

94

# Batch processing with preprocessing

95

matches = process.extract(query, choices,

96

processor=utils.default_process,

97

limit=3)

98

print(matches) # All matches benefit from normalization

99

```

100

101

### With Distance Metrics

102

103

Distance metrics also support preprocessing:

104

105

```python

106

from rapidfuzz.distance import Levenshtein

107

from rapidfuzz import utils

108

109

s1 = "Testing"

110

s2 = "TESTING!!!"

111

112

# Raw distance

113

dist1 = Levenshtein.distance(s1, s2)

114

print(f"Raw distance: {dist1}") # Higher due to case/punctuation

115

116

# With preprocessing

117

dist2 = Levenshtein.distance(s1, s2, processor=utils.default_process)

118

print(f"Processed distance: {dist2}") # 0 (identical after preprocessing)

119

```

120

121

### Custom Preprocessing

122

123

You can create custom preprocessing functions for specific needs:

124

125

```python

126

from rapidfuzz import fuzz

127

import re

128

129

def custom_preprocess(text):

130

"""Custom preprocessing for specific use case"""

131

# Convert to lowercase

132

text = text.lower()

133

# Remove only punctuation, keep numbers

134

text = re.sub(r'[^\w\s]', '', text)

135

# Normalize whitespace

136

text = ' '.join(text.split())

137

return text

138

139

def phone_preprocess(phone):

140

"""Preprocessing for phone numbers"""

141

# Remove all non-digits

142

return re.sub(r'[^\d]', '', phone)

143

144

# Use custom preprocessor

145

score = fuzz.ratio("Call (555) 123-4567", "555.123.4567",

146

processor=phone_preprocess)

147

print(score) # 100.0 (identical after removing formatting)

148

149

def name_preprocess(name):

150

"""Preprocessing for person names"""

151

import unicodedata

152

# Normalize unicode (handle accents)

153

name = unicodedata.normalize('NFKD', name)

154

name = ''.join(c for c in name if not unicodedata.combining(c))

155

# Convert to lowercase and remove punctuation

156

name = re.sub(r'[^\w\s]', '', name.lower())

157

# Handle common name variations

158

name = name.replace('jr', '').replace('sr', '').replace('iii', '').replace('ii', '')

159

return ' '.join(name.split())

160

161

# Better name matching

162

score = fuzz.ratio("José Martinez Jr.", "jose martinez",

163

processor=name_preprocess)

164

print(score) # High score despite accent and suffix differences

165

```

166

167

### When to Use Preprocessing

168

169

**Use preprocessing when:**

170

- Case differences are not meaningful ("Hello" vs "HELLO")

171

- Punctuation doesn't affect meaning ("don't" vs "dont")

172

- Whitespace variations occur ("New York" vs "New York")

173

- Special characters are formatting-only ("$100" vs "100")

174

- Unicode normalization needed ("café" vs "cafe")

175

176

**Don't use preprocessing when:**

177

- Case is semantically important (DNA sequences, product codes)

178

- Punctuation changes meaning ("can't" vs "cant", URLs)

179

- Exact character matching required (passwords, checksums)

180

- Performance is critical and strings already normalized

181

182

### Performance Considerations

183

184

```python

185

from rapidfuzz import process, utils

186

187

choices = ["..."] * 10000 # Large choice list

188

189

# Preprocessing overhead for single comparisons

190

query = "test query"

191

match = process.extractOne(query, choices, processor=utils.default_process)

192

193

# For repeated queries, preprocess choices once

194

processed_choices = [utils.default_process(choice) for choice in choices]

195

processed_query = utils.default_process(query)

196

197

# Now use without processor (already processed)

198

match = process.extractOne(processed_query, processed_choices)

199

200

# Or use a custom processor that caches results

201

preprocessing_cache = {}

202

203

def cached_preprocess(text):

204

if text not in preprocessing_cache:

205

preprocessing_cache[text] = utils.default_process(text)

206

return preprocessing_cache[text]

207

208

match = process.extractOne(query, choices, processor=cached_preprocess)

209

```

210

211

## Advanced Preprocessing Patterns

212

213

### Domain-Specific Preprocessing

214

215

```python

216

import re

217

from rapidfuzz import fuzz, utils

218

219

def address_preprocess(address):

220

"""Preprocessing for street addresses"""

221

address = address.lower()

222

# Standardize common abbreviations

223

address = re.sub(r'\bstreet\b|\bst\.?\b', 'st', address)

224

address = re.sub(r'\bavenue\b|\bave\.?\b', 'ave', address)

225

address = re.sub(r'\bblvd\.?\b|\bboulevard\b', 'blvd', address)

226

address = re.sub(r'\bdrive\b|\bdr\.?\b', 'dr', address)

227

# Remove apartment/suite numbers

228

address = re.sub(r'\b(apt|apartment|suite|ste|unit)\s*\w+', '', address)

229

return ' '.join(address.split())

230

231

def product_code_preprocess(code):

232

"""Preprocessing for product codes"""

233

# Remove common separators but keep structure

234

code = code.upper()

235

code = re.sub(r'[-_\s]', '', code)

236

return code

237

238

# Usage examples

239

addr1 = "123 Main Street, Apt 4B"

240

addr2 = "123 main st"

241

score = fuzz.ratio(addr1, addr2, processor=address_preprocess)

242

print(f"Address match: {score}") # High similarity

243

244

code1 = "ABC-123-XYZ"

245

code2 = "abc123xyz"

246

score = fuzz.ratio(code1, code2, processor=product_code_preprocess)

247

print(f"Product code match: {score}") # Perfect match

248

```

249

250

### Handling Non-ASCII Text

251

252

```python

253

import unicodedata

254

from rapidfuzz import fuzz

255

256

def unicode_normalize_preprocess(text):

257

"""Handle accented characters and unicode normalization"""

258

# Normalize unicode representation

259

text = unicodedata.normalize('NFKD', text)

260

# Remove combining characters (accents)

261

text = ''.join(c for c in text if not unicodedata.combining(c))

262

# Apply standard preprocessing

263

return utils.default_process(text)

264

265

# International text matching

266

text1 = "Café résumé naïve"

267

text2 = "cafe resume naive"

268

score = fuzz.ratio(text1, text2, processor=unicode_normalize_preprocess)

269

print(f"Unicode normalized: {score}") # Perfect match

270

```