Tessl Tile for pypi/strsim@0.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

edit-distance.md factory-utilities.md index.md ngram-shingle.md phonetic-record-linkage.md sequence-based.md

phonetic-record-linkage.mddocs/

0
# Phonetic and Record Linkage Algorithms
1

2
Algorithms specifically designed for fuzzy matching, typo correction, and record linkage applications. These algorithms excel at handling short strings like person names and are optimized to detect common typing errors and character transpositions.
3

4
## Capabilities
5

6
### Jaro-Winkler Similarity
7

8
A string similarity metric developed for record linkage and duplicate detection, particularly effective for short strings such as person names. The algorithm gives higher similarity scores to strings that match from the beginning, making it well-suited for detecting typos in names and identifiers.
9

10
```python { .api }
11
class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):
12
    def __init__(self, threshold: float = 0.7):
13
        """
14
        Initialize Jaro-Winkler with similarity threshold.
15
        
16
        Args:
17
            threshold: Threshold above which prefix bonus is applied (default: 0.7)
18
        """
19
    
20
    def get_threshold(self) -> float:
21
        """
22
        Get the current threshold value.
23
        
24
        Returns:
25
            float: Threshold value for prefix bonus application
26
        """
27
    
28
    def similarity(self, s0: str, s1: str) -> float:
29
        """
30
        Calculate Jaro-Winkler similarity between two strings.
31
        
32
        Args:
33
            s0: First string
34
            s1: Second string
35
            
36
        Returns:
37
            float: Similarity score in range [0.0, 1.0] where 1.0 = identical
38
            
39
        Raises:
40
            TypeError: If either string is None
41
        """
42
    
43
    def distance(self, s0: str, s1: str) -> float:
44
        """
45
        Calculate Jaro-Winkler distance (1 - similarity).
46
        
47
        Args:
48
            s0: First string
49
            s1: Second string
50
            
51
        Returns:
52
            float: Distance score in range [0.0, 1.0] where 0.0 = identical
53
            
54
        Raises:
55
            TypeError: If either string is None
56
        """
57
    
58
    @staticmethod
59
    def matches(s0: str, s1: str) -> list:
60
        """
61
        Calculate detailed match statistics for Jaro-Winkler algorithm.
62
        
63
        Args:
64
            s0: First string
65
            s1: Second string
66
            
67
        Returns:
68
            list: [matches, transpositions, prefix_length, max_length]
69
        """
70
```
71

72
**Usage Examples:**
73

74
```python
75
from similarity.jarowinkler import JaroWinkler
76

77
# Basic usage with default threshold
78
jw = JaroWinkler()
79
similarity = jw.similarity('Martha', 'Marhta')    # Returns: ~0.961 (high due to common prefix)
80
similarity = jw.similarity('Dixon', 'Dicksonx')  # Returns: ~0.767
81
distance = jw.distance('Martha', 'Marhta')       # Returns: ~0.039
82

83
# Custom threshold
84
jw_custom = JaroWinkler(threshold=0.8)
85
similarity = jw_custom.similarity('John', 'Jon')  # Different behavior with higher threshold
86

87
# Name matching use case
88
names = ['Smith', 'Smyth', 'Schmidt']
89
query = 'Smythe'
90
for name in names:
91
    score = jw.similarity(query, name)
92
    print(f"{query} vs {name}: {score:.3f}")
93
```
94

95
### Damerau-Levenshtein Distance
96

97
The full Damerau-Levenshtein distance with unrestricted transpositions, allowing any number of edit operations on substrings. This metric distance supports insertions, deletions, substitutions, and transpositions of adjacent characters, making it effective for detecting common typing errors.
98

99
```python { .api }
100
class Damerau(MetricStringDistance):
101
    def distance(self, s0: str, s1: str) -> float:
102
        """
103
        Calculate Damerau-Levenshtein distance with unrestricted transpositions.
104
        
105
        Args:
106
            s0: First string
107
            s1: Second string
108
            
109
        Returns:
110
            float: Edit distance including transpositions (minimum 0, no maximum limit)
111
            
112
        Raises:
113
            TypeError: If either string is None
114
        """
115
```
116

117
**Usage Examples:**
118

119
```python
120
from similarity.damerau import Damerau
121

122
damerau = Damerau()
123

124
# Basic transposition
125
distance = damerau.distance('ABCDEF', 'ABDCEF')  # Returns: 1.0 (single transposition)
126

127
# Multiple operations
128
distance = damerau.distance('ABCDEF', 'BACDFE')  # Returns: 2.0
129

130
# Common typing errors
131
distance = damerau.distance('teh', 'the')        # Returns: 1.0 (transposition)
132
distance = damerau.distance('recieve', 'receive') # Returns: 1.0 (transposition)
133

134
# Comparison with strings
135
test_cases = [
136
    ('ABCDEF', 'ABDCEF'),  # Single transposition
137
    ('ABCDEF', 'ABCDE'),   # Single deletion
138
    ('ABCDEF', 'ABCGDEF'), # Single insertion
139
    ('ABCDEF', 'POIU')     # Completely different
140
]
141

142
for s1, s2 in test_cases:
143
    dist = damerau.distance(s1, s2)
144
    print(f"'{s1}' vs '{s2}': {dist}")
145
```
146

147
### Algorithm Comparison
148

149
Both algorithms are designed for different use cases in record linkage and fuzzy matching:
150

151
**Jaro-Winkler** is ideal for:
152
- Short strings (names, identifiers)
153
- Applications where prefix similarity is important
154
- Record linkage and duplicate detection
155
- When you need normalized similarity scores
156

157
**Damerau-Levenshtein** is ideal for:
158
- Detecting typing errors and transpositions
159
- When you need a true metric distance
160
- Applications requiring the triangle inequality
161
- Longer strings where transposition errors are common
162

163
**Comparative Example:**
164

165
```python
166
from similarity.jarowinkler import JaroWinkler
167
from similarity.damerau import Damerau
168

169
jw = JaroWinkler()
170
damerau = Damerau()
171

172
test_pairs = [
173
    ('Martha', 'Marhta'),    # Name with transposition
174
    ('Smith', 'Schmidt'),    # Name with substitution
175
    ('hello', 'ehllo'),      # Simple transposition
176
]
177

178
for s1, s2 in test_pairs:
179
    jw_sim = jw.similarity(s1, s2)
180
    dam_dist = damerau.distance(s1, s2)
181
    print(f"'{s1}' vs '{s2}':")
182
    print(f"  Jaro-Winkler similarity: {jw_sim:.3f}")
183
    print(f"  Damerau distance: {dam_dist}")
184
```

Version

Tile

Files

phonetic-record-linkage.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

phonetic-record-linkage.mddocs/