A library implementing different string similarity and distance measures
Algorithms specifically designed for fuzzy matching, typo correction, and record linkage applications. These algorithms excel at handling short strings like person names and are optimized to detect common typing errors and character transpositions.
A string similarity metric developed for record linkage and duplicate detection, particularly effective for short strings such as person names. The algorithm gives higher similarity scores to strings that match from the beginning, making it well-suited for detecting typos in names and identifiers.
class JaroWinkler(NormalizedStringSimilarity, NormalizedStringDistance):
def __init__(self, threshold: float = 0.7):
"""
Initialize Jaro-Winkler with similarity threshold.
Args:
threshold: Threshold above which prefix bonus is applied (default: 0.7)
"""
def get_threshold(self) -> float:
"""
Get the current threshold value.
Returns:
float: Threshold value for prefix bonus application
"""
def similarity(self, s0: str, s1: str) -> float:
"""
Calculate Jaro-Winkler similarity between two strings.
Args:
s0: First string
s1: Second string
Returns:
float: Similarity score in range [0.0, 1.0] where 1.0 = identical
Raises:
TypeError: If either string is None
"""
def distance(self, s0: str, s1: str) -> float:
"""
Calculate Jaro-Winkler distance (1 - similarity).
Args:
s0: First string
s1: Second string
Returns:
float: Distance score in range [0.0, 1.0] where 0.0 = identical
Raises:
TypeError: If either string is None
"""
@staticmethod
def matches(s0: str, s1: str) -> list:
"""
Calculate detailed match statistics for Jaro-Winkler algorithm.
Args:
s0: First string
s1: Second string
Returns:
list: [matches, transpositions, prefix_length, max_length]
"""Usage Examples:
from similarity.jarowinkler import JaroWinkler
# Basic usage with default threshold
jw = JaroWinkler()
similarity = jw.similarity('Martha', 'Marhta') # Returns: ~0.961 (high due to common prefix)
similarity = jw.similarity('Dixon', 'Dicksonx') # Returns: ~0.767
distance = jw.distance('Martha', 'Marhta') # Returns: ~0.039
# Custom threshold
jw_custom = JaroWinkler(threshold=0.8)
similarity = jw_custom.similarity('John', 'Jon') # Different behavior with higher threshold
# Name matching use case
names = ['Smith', 'Smyth', 'Schmidt']
query = 'Smythe'
for name in names:
score = jw.similarity(query, name)
print(f"{query} vs {name}: {score:.3f}")The full Damerau-Levenshtein distance with unrestricted transpositions, allowing any number of edit operations on substrings. This metric distance supports insertions, deletions, substitutions, and transpositions of adjacent characters, making it effective for detecting common typing errors.
class Damerau(MetricStringDistance):
def distance(self, s0: str, s1: str) -> float:
"""
Calculate Damerau-Levenshtein distance with unrestricted transpositions.
Args:
s0: First string
s1: Second string
Returns:
float: Edit distance including transpositions (minimum 0, no maximum limit)
Raises:
TypeError: If either string is None
"""Usage Examples:
from similarity.damerau import Damerau
damerau = Damerau()
# Basic transposition
distance = damerau.distance('ABCDEF', 'ABDCEF') # Returns: 1.0 (single transposition)
# Multiple operations
distance = damerau.distance('ABCDEF', 'BACDFE') # Returns: 2.0
# Common typing errors
distance = damerau.distance('teh', 'the') # Returns: 1.0 (transposition)
distance = damerau.distance('recieve', 'receive') # Returns: 1.0 (transposition)
# Comparison with strings
test_cases = [
('ABCDEF', 'ABDCEF'), # Single transposition
('ABCDEF', 'ABCDE'), # Single deletion
('ABCDEF', 'ABCGDEF'), # Single insertion
('ABCDEF', 'POIU') # Completely different
]
for s1, s2 in test_cases:
dist = damerau.distance(s1, s2)
print(f"'{s1}' vs '{s2}': {dist}")Both algorithms are designed for different use cases in record linkage and fuzzy matching:
Jaro-Winkler is ideal for:
Damerau-Levenshtein is ideal for:
Comparative Example:
from similarity.jarowinkler import JaroWinkler
from similarity.damerau import Damerau
jw = JaroWinkler()
damerau = Damerau()
test_pairs = [
('Martha', 'Marhta'), # Name with transposition
('Smith', 'Schmidt'), # Name with substitution
('hello', 'ehllo'), # Simple transposition
]
for s1, s2 in test_pairs:
jw_sim = jw.similarity(s1, s2)
dam_dist = damerau.distance(s1, s2)
print(f"'{s1}' vs '{s2}':")
print(f" Jaro-Winkler similarity: {jw_sim:.3f}")
print(f" Damerau distance: {dam_dist}")Install with Tessl CLI
npx tessl i tessl/pypi-strsim