0
# String Distance and Similarity
1
2
Functions for computing various string distance metrics and similarity scores. These functions provide the core string comparison capabilities including Levenshtein distance, normalized similarity ratios, Hamming distance, and Jaro/Jaro-Winkler similarities.
3
4
## Capabilities
5
6
### Levenshtein Distance
7
8
Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into another according to Levenshtein with custom costs for insertion, deletion and substitution.
9
10
```python { .api }
11
def distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None):
12
"""
13
Calculate Levenshtein distance with custom operation weights.
14
15
Parameters:
16
- s1: First string to compare
17
- s2: Second string to compare
18
- weights: Tuple of (insertion, deletion, substitution) weights. Default (1, 1, 1)
19
- processor: Optional callable to preprocess strings before comparison
20
- score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded
21
- score_hint: Expected distance hint for algorithm optimization
22
23
Returns:
24
int: Distance between s1 and s2
25
26
Raises:
27
ValueError: If unsupported weights are provided
28
"""
29
```
30
31
**Usage Examples:**
32
33
```python
34
import Levenshtein
35
36
# Basic distance calculation
37
dist = Levenshtein.distance("kitten", "sitting")
38
print(dist) # 3
39
40
# Custom weights (insertion, deletion, substitution)
41
dist = Levenshtein.distance("kitten", "sitting", weights=(1, 1, 2))
42
print(dist) # 4 (substitutions cost more)
43
44
# With score cutoff for performance
45
dist = Levenshtein.distance("very long string", "another long string", score_cutoff=5)
46
print(dist) # Returns actual distance or 6 if > 5
47
```
48
49
### Similarity Ratio
50
51
Calculates a normalized indel similarity in the range [0, 1]. The indel distance calculates the minimum number of insertions and deletions required to change one sequence into the other.
52
53
```python { .api }
54
def ratio(s1, s2, *, processor=None, score_cutoff=None):
55
"""
56
Calculate normalized indel similarity ratio [0, 1].
57
58
This is calculated as 1 - (distance / (len1 + len2))
59
60
Parameters:
61
- s1: First string to compare
62
- s2: Second string to compare
63
- processor: Optional callable to preprocess strings before comparison
64
- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
65
66
Returns:
67
float: Normalized similarity between s1 and s2 as a float between 0 and 1.0
68
"""
69
```
70
71
**Usage Examples:**
72
73
```python
74
import Levenshtein
75
76
# Basic similarity ratio
77
ratio = Levenshtein.ratio("kitten", "sitting")
78
print(f"{ratio:.3f}") # 0.615
79
80
# With score cutoff
81
ratio = Levenshtein.ratio("hello", "world", score_cutoff=0.5)
82
print(ratio) # 0.0 (similarity is below 0.5)
83
84
# With custom processor
85
def preprocess(s):
86
return s.lower().strip()
87
88
ratio = Levenshtein.ratio(" Hello ", "HELLO", processor=preprocess)
89
print(ratio) # 1.0 (identical after processing)
90
```
91
92
### Hamming Distance
93
94
Calculates the Hamming distance between two strings. The hamming distance is defined as the number of positions where the two strings differ. It describes the minimum amount of substitutions required to transform s1 into s2.
95
96
```python { .api }
97
def hamming(s1, s2, *, pad=True, processor=None, score_cutoff=None):
98
"""
99
Calculate Hamming distance (substitutions only).
100
101
Parameters:
102
- s1: First string to compare
103
- s2: Second string to compare
104
- pad: Should strings be padded if there is a length difference
105
- processor: Optional callable to preprocess strings before comparison
106
- score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded
107
108
Returns:
109
int: Hamming distance between s1 and s2
110
111
Raises:
112
ValueError: If s1 and s2 have different length and pad=False
113
"""
114
```
115
116
**Usage Examples:**
117
118
```python
119
import Levenshtein
120
121
# Same length strings
122
dist = Levenshtein.hamming("karolin", "kathrin")
123
print(dist) # 3
124
125
# Different length strings with padding (default)
126
dist = Levenshtein.hamming("karolin", "kath")
127
print(dist) # 7 (3 substitutions + 4 for length difference)
128
129
# Different length strings without padding raises error
130
try:
131
dist = Levenshtein.hamming("karolin", "kath", pad=False)
132
except ValueError as e:
133
print("Length mismatch error")
134
```
135
136
### Jaro Similarity
137
138
Calculates the Jaro similarity between two strings. Jaro similarity is particularly effective for comparing names and short strings with character transpositions.
139
140
```python { .api }
141
def jaro(s1, s2, *, processor=None, score_cutoff=None):
142
"""
143
Calculate Jaro similarity.
144
145
Parameters:
146
- s1: First string to compare
147
- s2: Second string to compare
148
- processor: Optional callable to preprocess strings before comparison
149
- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
150
151
Returns:
152
float: Jaro similarity between s1 and s2 as a float between 0 and 1.0
153
"""
154
```
155
156
**Usage Examples:**
157
158
```python
159
import Levenshtein
160
161
# Basic Jaro similarity
162
sim = Levenshtein.jaro("martha", "marhta")
163
print(f"{sim:.3f}") # 0.944
164
165
# Names comparison
166
sim = Levenshtein.jaro("DIXON", "DICKSONX")
167
print(f"{sim:.3f}") # 0.767
168
```
169
170
### Jaro-Winkler Similarity
171
172
Calculates the Jaro-Winkler similarity, which gives more favorable ratings to strings with common prefixes. This is particularly useful for comparing names and addresses.
173
174
```python { .api }
175
def jaro_winkler(s1, s2, *, prefix_weight=0.1, processor=None, score_cutoff=None):
176
"""
177
Calculate Jaro-Winkler similarity with prefix weighting.
178
179
Parameters:
180
- s1: First string to compare
181
- s2: Second string to compare
182
- prefix_weight: Weight used for common prefix (0 to 0.25). Default 0.1
183
- processor: Optional callable to preprocess strings before comparison
184
- score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
185
186
Returns:
187
float: Jaro-Winkler similarity between s1 and s2 as a float between 0 and 1.0
188
189
Raises:
190
ValueError: If prefix_weight is not between 0 and 0.25
191
"""
192
```
193
194
**Usage Examples:**
195
196
```python
197
import Levenshtein
198
199
# Basic Jaro-Winkler similarity
200
sim = Levenshtein.jaro_winkler("martha", "marhta")
201
print(f"{sim:.3f}") # 0.961 (higher than Jaro due to common prefix "mar")
202
203
# Custom prefix weight
204
sim = Levenshtein.jaro_winkler("prefix_test", "prefix_demo", prefix_weight=0.2)
205
print(f"{sim:.3f}") # Higher weight for common prefix
206
207
# Names with common prefix
208
sim = Levenshtein.jaro_winkler("JOHNSON", "JOHNSTON")
209
print(f"{sim:.3f}") # 0.957
210
```
211
212
## Common Usage Patterns
213
214
### Fuzzy String Matching
215
216
```python
217
import Levenshtein
218
219
def fuzzy_match(target, candidates, threshold=0.8):
220
"""Find best matches from candidates list."""
221
matches = []
222
for candidate in candidates:
223
ratio = Levenshtein.ratio(target, candidate)
224
if ratio >= threshold:
225
matches.append((candidate, ratio))
226
return sorted(matches, key=lambda x: x[1], reverse=True)
227
228
# Example usage
229
candidates = ["apple", "application", "apply", "april", "ample"]
230
matches = fuzzy_match("appl", candidates, threshold=0.6)
231
print(matches) # [('apply', 0.8), ('apple', 0.8), ('ample', 0.8)]
232
```
233
234
### Spell Checking
235
236
```python
237
import Levenshtein
238
239
def suggest_corrections(word, dictionary, max_distance=2, max_suggestions=5):
240
"""Suggest spelling corrections."""
241
suggestions = []
242
for dict_word in dictionary:
243
dist = Levenshtein.distance(word, dict_word)
244
if dist <= max_distance:
245
ratio = Levenshtein.ratio(word, dict_word)
246
suggestions.append((dict_word, dist, ratio))
247
248
# Sort by distance (ascending) then ratio (descending)
249
suggestions.sort(key=lambda x: (x[1], -x[2]))
250
return [word for word, _, _ in suggestions[:max_suggestions]]
251
```
252
253
### Performance Optimization
254
255
```python
256
import Levenshtein
257
258
# Use score_cutoff for early termination
259
def fast_filter(query, candidates, max_distance=3):
260
"""Quickly filter candidates by maximum distance."""
261
results = []
262
for candidate in candidates:
263
dist = Levenshtein.distance(query, candidate, score_cutoff=max_distance)
264
if dist <= max_distance:
265
results.append((candidate, dist))
266
return results
267
268
# Use score_hint when you have an expected distance
269
def optimized_distance(s1, s2, expected_dist=None):
270
"""Calculate distance with optimization hint."""
271
return Levenshtein.distance(s1, s2, score_hint=expected_dist)
272
```