Tessl Tile for pypi/levenshtein@0.27.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

edit-operations.md index.md string-averaging.md string-distance.md

string-distance.mddocs/

0
# String Distance and Similarity
1

2
Functions for computing various string distance metrics and similarity scores. These functions provide the core string comparison capabilities including Levenshtein distance, normalized similarity ratios, Hamming distance, and Jaro/Jaro-Winkler similarities.
3

4
## Capabilities
5

6
### Levenshtein Distance
7

8
Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into another according to Levenshtein with custom costs for insertion, deletion and substitution.
9

10
```python { .api }
11
def distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None, score_hint=None):
12
    """
13
    Calculate Levenshtein distance with custom operation weights.
14
    
15
    Parameters:
16
    - s1: First string to compare
17
    - s2: Second string to compare  
18
    - weights: Tuple of (insertion, deletion, substitution) weights. Default (1, 1, 1)
19
    - processor: Optional callable to preprocess strings before comparison
20
    - score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded
21
    - score_hint: Expected distance hint for algorithm optimization
22
    
23
    Returns:
24
    int: Distance between s1 and s2
25
    
26
    Raises:
27
    ValueError: If unsupported weights are provided
28
    """
29
```
30

31
**Usage Examples:**
32

33
```python
34
import Levenshtein
35

36
# Basic distance calculation
37
dist = Levenshtein.distance("kitten", "sitting")
38
print(dist)  # 3
39

40
# Custom weights (insertion, deletion, substitution)
41
dist = Levenshtein.distance("kitten", "sitting", weights=(1, 1, 2))  
42
print(dist)  # 4 (substitutions cost more)
43

44
# With score cutoff for performance
45
dist = Levenshtein.distance("very long string", "another long string", score_cutoff=5)
46
print(dist)  # Returns actual distance or 6 if > 5
47
```
48

49
### Similarity Ratio
50

51
Calculates a normalized indel similarity in the range [0, 1]. The indel distance calculates the minimum number of insertions and deletions required to change one sequence into the other.
52

53
```python { .api }
54
def ratio(s1, s2, *, processor=None, score_cutoff=None):
55
    """
56
    Calculate normalized indel similarity ratio [0, 1].
57
    
58
    This is calculated as 1 - (distance / (len1 + len2))
59
    
60
    Parameters:
61
    - s1: First string to compare
62
    - s2: Second string to compare
63
    - processor: Optional callable to preprocess strings before comparison
64
    - score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
65
    
66
    Returns:
67
    float: Normalized similarity between s1 and s2 as a float between 0 and 1.0
68
    """
69
```
70

71
**Usage Examples:**
72

73
```python
74
import Levenshtein
75

76
# Basic similarity ratio
77
ratio = Levenshtein.ratio("kitten", "sitting")
78
print(f"{ratio:.3f}")  # 0.615
79

80
# With score cutoff
81
ratio = Levenshtein.ratio("hello", "world", score_cutoff=0.5)
82
print(ratio)  # 0.0 (similarity is below 0.5)
83

84
# With custom processor
85
def preprocess(s):
86
    return s.lower().strip()
87

88
ratio = Levenshtein.ratio("  Hello  ", "HELLO", processor=preprocess)
89
print(ratio)  # 1.0 (identical after processing)
90
```
91

92
### Hamming Distance
93

94
Calculates the Hamming distance between two strings. The hamming distance is defined as the number of positions where the two strings differ. It describes the minimum amount of substitutions required to transform s1 into s2.
95

96
```python { .api }
97
def hamming(s1, s2, *, pad=True, processor=None, score_cutoff=None):
98
    """
99
    Calculate Hamming distance (substitutions only).
100
    
101
    Parameters:
102
    - s1: First string to compare
103
    - s2: Second string to compare
104
    - pad: Should strings be padded if there is a length difference
105
    - processor: Optional callable to preprocess strings before comparison
106
    - score_cutoff: Maximum distance to consider. Returns cutoff + 1 if exceeded
107
    
108
    Returns:
109
    int: Hamming distance between s1 and s2
110
    
111
    Raises:
112
    ValueError: If s1 and s2 have different length and pad=False
113
    """
114
```
115

116
**Usage Examples:**
117

118
```python
119
import Levenshtein
120

121
# Same length strings
122
dist = Levenshtein.hamming("karolin", "kathrin")
123
print(dist)  # 3
124

125
# Different length strings with padding (default)
126
dist = Levenshtein.hamming("karolin", "kath") 
127
print(dist)  # 7 (3 substitutions + 4 for length difference)
128

129
# Different length strings without padding raises error
130
try:
131
    dist = Levenshtein.hamming("karolin", "kath", pad=False)
132
except ValueError as e:
133
    print("Length mismatch error")
134
```
135

136
### Jaro Similarity
137

138
Calculates the Jaro similarity between two strings. Jaro similarity is particularly effective for comparing names and short strings with character transpositions.
139

140
```python { .api }
141
def jaro(s1, s2, *, processor=None, score_cutoff=None):
142
    """
143
    Calculate Jaro similarity.
144
    
145
    Parameters:
146
    - s1: First string to compare
147
    - s2: Second string to compare
148
    - processor: Optional callable to preprocess strings before comparison
149
    - score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
150
    
151
    Returns:
152
    float: Jaro similarity between s1 and s2 as a float between 0 and 1.0
153
    """
154
```
155

156
**Usage Examples:**
157

158
```python
159
import Levenshtein
160

161
# Basic Jaro similarity
162
sim = Levenshtein.jaro("martha", "marhta")
163
print(f"{sim:.3f}")  # 0.944
164

165
# Names comparison
166
sim = Levenshtein.jaro("DIXON", "DICKSONX")  
167
print(f"{sim:.3f}")  # 0.767
168
```
169

170
### Jaro-Winkler Similarity
171

172
Calculates the Jaro-Winkler similarity, which gives more favorable ratings to strings with common prefixes. This is particularly useful for comparing names and addresses.
173

174
```python { .api }
175
def jaro_winkler(s1, s2, *, prefix_weight=0.1, processor=None, score_cutoff=None):
176
    """
177
    Calculate Jaro-Winkler similarity with prefix weighting.
178
    
179
    Parameters:
180
    - s1: First string to compare
181
    - s2: Second string to compare
182
    - prefix_weight: Weight used for common prefix (0 to 0.25). Default 0.1
183
    - processor: Optional callable to preprocess strings before comparison
184
    - score_cutoff: Minimum similarity threshold. Returns 0 if below threshold
185
    
186
    Returns:
187
    float: Jaro-Winkler similarity between s1 and s2 as a float between 0 and 1.0
188
    
189
    Raises:
190
    ValueError: If prefix_weight is not between 0 and 0.25
191
    """
192
```
193

194
**Usage Examples:**
195

196
```python
197
import Levenshtein
198

199
# Basic Jaro-Winkler similarity
200
sim = Levenshtein.jaro_winkler("martha", "marhta")  
201
print(f"{sim:.3f}")  # 0.961 (higher than Jaro due to common prefix "mar")
202

203
# Custom prefix weight
204
sim = Levenshtein.jaro_winkler("prefix_test", "prefix_demo", prefix_weight=0.2)
205
print(f"{sim:.3f}")  # Higher weight for common prefix
206

207
# Names with common prefix
208
sim = Levenshtein.jaro_winkler("JOHNSON", "JOHNSTON")
209
print(f"{sim:.3f}")  # 0.957
210
```
211

212
## Common Usage Patterns
213

214
### Fuzzy String Matching
215

216
```python
217
import Levenshtein
218

219
def fuzzy_match(target, candidates, threshold=0.8):
220
    """Find best matches from candidates list."""
221
    matches = []
222
    for candidate in candidates:
223
        ratio = Levenshtein.ratio(target, candidate)
224
        if ratio >= threshold:
225
            matches.append((candidate, ratio))
226
    return sorted(matches, key=lambda x: x[1], reverse=True)
227

228
# Example usage
229
candidates = ["apple", "application", "apply", "april", "ample"]
230
matches = fuzzy_match("appl", candidates, threshold=0.6)
231
print(matches)  # [('apply', 0.8), ('apple', 0.8), ('ample', 0.8)]
232
```
233

234
### Spell Checking
235

236
```python
237
import Levenshtein
238

239
def suggest_corrections(word, dictionary, max_distance=2, max_suggestions=5):
240
    """Suggest spelling corrections."""
241
    suggestions = []
242
    for dict_word in dictionary:
243
        dist = Levenshtein.distance(word, dict_word)
244
        if dist <= max_distance:
245
            ratio = Levenshtein.ratio(word, dict_word)
246
            suggestions.append((dict_word, dist, ratio))
247
    
248
    # Sort by distance (ascending) then ratio (descending)
249
    suggestions.sort(key=lambda x: (x[1], -x[2]))
250
    return [word for word, _, _ in suggestions[:max_suggestions]]
251
```
252

253
### Performance Optimization
254

255
```python
256
import Levenshtein
257

258
# Use score_cutoff for early termination
259
def fast_filter(query, candidates, max_distance=3):
260
    """Quickly filter candidates by maximum distance."""
261
    results = []
262
    for candidate in candidates:
263
        dist = Levenshtein.distance(query, candidate, score_cutoff=max_distance)
264
        if dist <= max_distance:
265
            results.append((candidate, dist))
266
    return results
267

268
# Use score_hint when you have an expected distance
269
def optimized_distance(s1, s2, expected_dist=None):
270
    """Calculate distance with optimization hint."""
271
    return Levenshtein.distance(s1, s2, score_hint=expected_dist)
272
```

Version

Tile

Files

string-distance.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

string-distance.mddocs/