0
# Edit Distance Algorithms
1
2
Classic edit distance algorithms that measure the minimum number of character operations (insertions, deletions, substitutions, and sometimes transpositions) needed to transform one string into another. These algorithms are fundamental to many text processing and comparison tasks.
3
4
## Capabilities
5
6
### Levenshtein Distance
7
8
The classic Levenshtein distance algorithm computes the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. This is a metric distance that satisfies the triangle inequality.
9
10
```python { .api }
11
class Levenshtein(MetricStringDistance):
12
def distance(self, s0: str, s1: str) -> float:
13
"""
14
Calculate Levenshtein distance between two strings.
15
16
Args:
17
s0: First string
18
s1: Second string
19
20
Returns:
21
float: Number of edit operations (minimum 0, no maximum limit)
22
23
Raises:
24
TypeError: If either string is None
25
"""
26
```
27
28
**Usage Example:**
29
30
```python
31
from similarity.levenshtein import Levenshtein
32
33
lev = Levenshtein()
34
distance = lev.distance('kitten', 'sitting') # Returns: 3.0
35
distance = lev.distance('hello', 'hello') # Returns: 0.0
36
```
37
38
### Normalized Levenshtein
39
40
A normalized version of Levenshtein distance that returns values in the range [0.0, 1.0] by dividing the edit distance by the length of the longest string. Also provides similarity as 1 - normalized distance.
41
42
```python { .api }
43
class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):
44
def __init__(self): ...
45
46
def distance(self, s0: str, s1: str) -> float:
47
"""
48
Calculate normalized Levenshtein distance (0.0-1.0).
49
50
Args:
51
s0: First string
52
s1: Second string
53
54
Returns:
55
float: Normalized distance where 0.0 = identical, 1.0 = completely different
56
57
Raises:
58
TypeError: If either string is None
59
"""
60
61
def similarity(self, s0: str, s1: str) -> float:
62
"""
63
Calculate normalized Levenshtein similarity (0.0-1.0).
64
65
Args:
66
s0: First string
67
s1: Second string
68
69
Returns:
70
float: Normalized similarity where 1.0 = identical, 0.0 = completely different
71
72
Raises:
73
TypeError: If either string is None
74
"""
75
```
76
77
**Usage Example:**
78
79
```python
80
from similarity.normalized_levenshtein import NormalizedLevenshtein
81
82
norm_lev = NormalizedLevenshtein()
83
distance = norm_lev.distance('kitten', 'sitting') # Returns: ~0.428
84
similarity = norm_lev.similarity('kitten', 'sitting') # Returns: ~0.571
85
```
86
87
### Weighted Levenshtein
88
89
An implementation of Levenshtein distance that allows custom weights for different character operations. Commonly used for OCR applications where certain character substitutions are more likely, or keyboard typing auto-correction where adjacent keys have lower substitution costs.
90
91
```python { .api }
92
class WeightedLevenshtein(StringDistance):
93
def __init__(self, character_substitution: CharacterSubstitutionInterface,
94
character_ins_del: CharacterInsDelInterface = None):
95
"""
96
Initialize with custom cost interfaces.
97
98
Args:
99
character_substitution: Interface defining substitution costs
100
character_ins_del: Optional interface for insertion/deletion costs (default: 1.0)
101
"""
102
103
def distance(self, s0: str, s1: str) -> float:
104
"""
105
Calculate weighted Levenshtein distance with custom operation costs.
106
107
Args:
108
s0: First string
109
s1: Second string
110
111
Returns:
112
float: Weighted edit distance
113
114
Raises:
115
TypeError: If either string is None
116
"""
117
118
class CharacterSubstitutionInterface:
119
def cost(self, c0: str, c1: str) -> float:
120
"""
121
Return the cost of substituting character c0 with c1.
122
123
Args:
124
c0: Original character
125
c1: Replacement character
126
127
Returns:
128
float: Cost of substitution (typically 0.0-1.0 range)
129
"""
130
131
class CharacterInsDelInterface:
132
def deletion_cost(self, c: str) -> float:
133
"""
134
Return the cost of deleting character c.
135
136
Args:
137
c: Character to delete
138
139
Returns:
140
float: Cost of deletion
141
"""
142
143
def insertion_cost(self, c: str) -> float:
144
"""
145
Return the cost of inserting character c.
146
147
Args:
148
c: Character to insert
149
150
Returns:
151
float: Cost of insertion
152
"""
153
```
154
155
**Usage Example:**
156
157
```python
158
from similarity.weighted_levenshtein import WeightedLevenshtein, CharacterSubstitutionInterface
159
160
class OCRSubstitution(CharacterSubstitutionInterface):
161
def cost(self, c0, c1):
162
# Lower cost for visually similar characters
163
if (c0, c1) in [('o', '0'), ('l', '1'), ('S', '5')]:
164
return 0.3
165
return 1.0
166
167
weighted_lev = WeightedLevenshtein(OCRSubstitution())
168
distance = weighted_lev.distance('S0me1ext', 'Some1ext') # Lower cost due to custom weights
169
```
170
171
### Optimal String Alignment
172
173
A variant of Damerau-Levenshtein that computes edit distance with the restriction that no substring can be edited more than once. Supports insertions, deletions, substitutions, and adjacent character transpositions.
174
175
```python { .api }
176
class OptimalStringAlignment(StringDistance):
177
def distance(self, s0: str, s1: str) -> float:
178
"""
179
Calculate Optimal String Alignment distance.
180
181
Args:
182
s0: First string
183
s1: Second string
184
185
Returns:
186
float: OSA distance (minimum 0, no maximum limit)
187
188
Raises:
189
TypeError: If either string is None
190
"""
191
```
192
193
**Usage Example:**
194
195
```python
196
from similarity.optimal_string_alignment import OptimalStringAlignment
197
198
osa = OptimalStringAlignment()
199
distance = osa.distance('CA', 'ABC') # Returns: 3.0
200
distance = osa.distance('hello', 'ehllo') # Returns: 1.0 (transposition)
201
```