or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

edit-distance.mdfactory-utilities.mdindex.mdngram-shingle.mdphonetic-record-linkage.mdsequence-based.md

phonetic-record-linkage.mddocs/

0

# Phonetic and Record Linkage Algorithms

1

2

Algorithms specifically designed for fuzzy matching, typo correction, and record linkage applications. These algorithms excel at handling short strings like person names and are optimized to detect common typing errors and character transpositions.

3

4

## Capabilities

5

6

### Jaro-Winkler Similarity

7

8

A string similarity metric developed for record linkage and duplicate detection, particularly effective for short strings such as person names. The algorithm gives higher similarity scores to strings that match from the beginning, making it well-suited for detecting typos in names and identifiers.

9

10

```python { .api }

11

class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):

12

def __init__(self, threshold: float = 0.7):

13

"""

14

Initialize Jaro-Winkler with similarity threshold.

15

16

Args:

17

threshold: Threshold above which prefix bonus is applied (default: 0.7)

18

"""

19

20

def get_threshold(self) -> float:

21

"""

22

Get the current threshold value.

23

24

Returns:

25

float: Threshold value for prefix bonus application

26

"""

27

28

def similarity(self, s0: str, s1: str) -> float:

29

"""

30

Calculate Jaro-Winkler similarity between two strings.

31

32

Args:

33

s0: First string

34

s1: Second string

35

36

Returns:

37

float: Similarity score in range [0.0, 1.0] where 1.0 = identical

38

39

Raises:

40

TypeError: If either string is None

41

"""

42

43

def distance(self, s0: str, s1: str) -> float:

44

"""

45

Calculate Jaro-Winkler distance (1 - similarity).

46

47

Args:

48

s0: First string

49

s1: Second string

50

51

Returns:

52

float: Distance score in range [0.0, 1.0] where 0.0 = identical

53

54

Raises:

55

TypeError: If either string is None

56

"""

57

58

@staticmethod

59

def matches(s0: str, s1: str) -> list:

60

"""

61

Calculate detailed match statistics for Jaro-Winkler algorithm.

62

63

Args:

64

s0: First string

65

s1: Second string

66

67

Returns:

68

list: [matches, transpositions, prefix_length, max_length]

69

"""

70

```

71

72

**Usage Examples:**

73

74

```python

75

from similarity.jarowinkler import JaroWinkler

76

77

# Basic usage with default threshold

78

jw = JaroWinkler()

79

similarity = jw.similarity('Martha', 'Marhta') # Returns: ~0.961 (high due to common prefix)

80

similarity = jw.similarity('Dixon', 'Dicksonx') # Returns: ~0.767

81

distance = jw.distance('Martha', 'Marhta') # Returns: ~0.039

82

83

# Custom threshold

84

jw_custom = JaroWinkler(threshold=0.8)

85

similarity = jw_custom.similarity('John', 'Jon') # Different behavior with higher threshold

86

87

# Name matching use case

88

names = ['Smith', 'Smyth', 'Schmidt']

89

query = 'Smythe'

90

for name in names:

91

score = jw.similarity(query, name)

92

print(f"{query} vs {name}: {score:.3f}")

93

```

94

95

### Damerau-Levenshtein Distance

96

97

The full Damerau-Levenshtein distance with unrestricted transpositions, allowing any number of edit operations on substrings. This metric distance supports insertions, deletions, substitutions, and transpositions of adjacent characters, making it effective for detecting common typing errors.

98

99

```python { .api }

100

class Damerau(MetricStringDistance):

101

def distance(self, s0: str, s1: str) -> float:

102

"""

103

Calculate Damerau-Levenshtein distance with unrestricted transpositions.

104

105

Args:

106

s0: First string

107

s1: Second string

108

109

Returns:

110

float: Edit distance including transpositions (minimum 0, no maximum limit)

111

112

Raises:

113

TypeError: If either string is None

114

"""

115

```

116

117

**Usage Examples:**

118

119

```python

120

from similarity.damerau import Damerau

121

122

damerau = Damerau()

123

124

# Basic transposition

125

distance = damerau.distance('ABCDEF', 'ABDCEF') # Returns: 1.0 (single transposition)

126

127

# Multiple operations

128

distance = damerau.distance('ABCDEF', 'BACDFE') # Returns: 2.0

129

130

# Common typing errors

131

distance = damerau.distance('teh', 'the') # Returns: 1.0 (transposition)

132

distance = damerau.distance('recieve', 'receive') # Returns: 1.0 (transposition)

133

134

# Comparison with strings

135

test_cases = [

136

('ABCDEF', 'ABDCEF'), # Single transposition

137

('ABCDEF', 'ABCDE'), # Single deletion

138

('ABCDEF', 'ABCGDEF'), # Single insertion

139

('ABCDEF', 'POIU') # Completely different

140

]

141

142

for s1, s2 in test_cases:

143

dist = damerau.distance(s1, s2)

144

print(f"'{s1}' vs '{s2}': {dist}")

145

```

146

147

### Algorithm Comparison

148

149

Both algorithms are designed for different use cases in record linkage and fuzzy matching:

150

151

**Jaro-Winkler** is ideal for:

152

- Short strings (names, identifiers)

153

- Applications where prefix similarity is important

154

- Record linkage and duplicate detection

155

- When you need normalized similarity scores

156

157

**Damerau-Levenshtein** is ideal for:

158

- Detecting typing errors and transpositions

159

- When you need a true metric distance

160

- Applications requiring the triangle inequality

161

- Longer strings where transposition errors are common

162

163

**Comparative Example:**

164

165

```python

166

from similarity.jarowinkler import JaroWinkler

167

from similarity.damerau import Damerau

168

169

jw = JaroWinkler()

170

damerau = Damerau()

171

172

test_pairs = [

173

('Martha', 'Marhta'), # Name with transposition

174

('Smith', 'Schmidt'), # Name with substitution

175

('hello', 'ehllo'), # Simple transposition

176

]

177

178

for s1, s2 in test_pairs:

179

jw_sim = jw.similarity(s1, s2)

180

dam_dist = damerau.distance(s1, s2)

181

print(f"'{s1}' vs '{s2}':")

182

print(f" Jaro-Winkler similarity: {jw_sim:.3f}")

183

print(f" Damerau distance: {dam_dist}")

184

```