0
# Phonetic and Record Linkage Algorithms
1
2
Algorithms specifically designed for fuzzy matching, typo correction, and record linkage applications. These algorithms excel at handling short strings like person names and are optimized to detect common typing errors and character transpositions.
3
4
## Capabilities
5
6
### Jaro-Winkler Similarity
7
8
A string similarity metric developed for record linkage and duplicate detection, particularly effective for short strings such as person names. The algorithm gives higher similarity scores to strings that match from the beginning, making it well-suited for detecting typos in names and identifiers.
9
10
```python { .api }
11
class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):
12
def __init__(self, threshold: float = 0.7):
13
"""
14
Initialize Jaro-Winkler with similarity threshold.
15
16
Args:
17
threshold: Threshold above which prefix bonus is applied (default: 0.7)
18
"""
19
20
def get_threshold(self) -> float:
21
"""
22
Get the current threshold value.
23
24
Returns:
25
float: Threshold value for prefix bonus application
26
"""
27
28
def similarity(self, s0: str, s1: str) -> float:
29
"""
30
Calculate Jaro-Winkler similarity between two strings.
31
32
Args:
33
s0: First string
34
s1: Second string
35
36
Returns:
37
float: Similarity score in range [0.0, 1.0] where 1.0 = identical
38
39
Raises:
40
TypeError: If either string is None
41
"""
42
43
def distance(self, s0: str, s1: str) -> float:
44
"""
45
Calculate Jaro-Winkler distance (1 - similarity).
46
47
Args:
48
s0: First string
49
s1: Second string
50
51
Returns:
52
float: Distance score in range [0.0, 1.0] where 0.0 = identical
53
54
Raises:
55
TypeError: If either string is None
56
"""
57
58
@staticmethod
59
def matches(s0: str, s1: str) -> list:
60
"""
61
Calculate detailed match statistics for Jaro-Winkler algorithm.
62
63
Args:
64
s0: First string
65
s1: Second string
66
67
Returns:
68
list: [matches, transpositions, prefix_length, max_length]
69
"""
70
```
71
72
**Usage Examples:**
73
74
```python
75
from similarity.jarowinkler import JaroWinkler
76
77
# Basic usage with default threshold
78
jw = JaroWinkler()
79
similarity = jw.similarity('Martha', 'Marhta') # Returns: ~0.961 (high due to common prefix)
80
similarity = jw.similarity('Dixon', 'Dicksonx') # Returns: ~0.767
81
distance = jw.distance('Martha', 'Marhta') # Returns: ~0.039
82
83
# Custom threshold
84
jw_custom = JaroWinkler(threshold=0.8)
85
similarity = jw_custom.similarity('John', 'Jon') # Different behavior with higher threshold
86
87
# Name matching use case
88
names = ['Smith', 'Smyth', 'Schmidt']
89
query = 'Smythe'
90
for name in names:
91
score = jw.similarity(query, name)
92
print(f"{query} vs {name}: {score:.3f}")
93
```
94
95
### Damerau-Levenshtein Distance
96
97
The full Damerau-Levenshtein distance with unrestricted transpositions, allowing any number of edit operations on substrings. This metric distance supports insertions, deletions, substitutions, and transpositions of adjacent characters, making it effective for detecting common typing errors.
98
99
```python { .api }
100
class Damerau(MetricStringDistance):
101
def distance(self, s0: str, s1: str) -> float:
102
"""
103
Calculate Damerau-Levenshtein distance with unrestricted transpositions.
104
105
Args:
106
s0: First string
107
s1: Second string
108
109
Returns:
110
float: Edit distance including transpositions (minimum 0, no maximum limit)
111
112
Raises:
113
TypeError: If either string is None
114
"""
115
```
116
117
**Usage Examples:**
118
119
```python
120
from similarity.damerau import Damerau
121
122
damerau = Damerau()
123
124
# Basic transposition
125
distance = damerau.distance('ABCDEF', 'ABDCEF') # Returns: 1.0 (single transposition)
126
127
# Multiple operations
128
distance = damerau.distance('ABCDEF', 'BACDFE') # Returns: 2.0
129
130
# Common typing errors
131
distance = damerau.distance('teh', 'the') # Returns: 1.0 (transposition)
132
distance = damerau.distance('recieve', 'receive') # Returns: 1.0 (transposition)
133
134
# Comparison with strings
135
test_cases = [
136
('ABCDEF', 'ABDCEF'), # Single transposition
137
('ABCDEF', 'ABCDE'), # Single deletion
138
('ABCDEF', 'ABCGDEF'), # Single insertion
139
('ABCDEF', 'POIU') # Completely different
140
]
141
142
for s1, s2 in test_cases:
143
dist = damerau.distance(s1, s2)
144
print(f"'{s1}' vs '{s2}': {dist}")
145
```
146
147
### Algorithm Comparison
148
149
Both algorithms are designed for different use cases in record linkage and fuzzy matching:
150
151
**Jaro-Winkler** is ideal for:
152
- Short strings (names, identifiers)
153
- Applications where prefix similarity is important
154
- Record linkage and duplicate detection
155
- When you need normalized similarity scores
156
157
**Damerau-Levenshtein** is ideal for:
158
- Detecting typing errors and transpositions
159
- When you need a true metric distance
160
- Applications requiring the triangle inequality
161
- Longer strings where transposition errors are common
162
163
**Comparative Example:**
164
165
```python
166
from similarity.jarowinkler import JaroWinkler
167
from similarity.damerau import Damerau
168
169
jw = JaroWinkler()
170
damerau = Damerau()
171
172
test_pairs = [
173
('Martha', 'Marhta'), # Name with transposition
174
('Smith', 'Schmidt'), # Name with substitution
175
('hello', 'ehllo'), # Simple transposition
176
]
177
178
for s1, s2 in test_pairs:
179
jw_sim = jw.similarity(s1, s2)
180
dam_dist = damerau.distance(s1, s2)
181
print(f"'{s1}' vs '{s2}':")
182
print(f" Jaro-Winkler similarity: {jw_sim:.3f}")
183
print(f" Damerau distance: {dam_dist}")
184
```