A library implementing different string similarity and distance measures
Classic edit distance algorithms that measure the minimum number of character operations (insertions, deletions, substitutions, and sometimes transpositions) needed to transform one string into another. These algorithms are fundamental to many text processing and comparison tasks.
The classic Levenshtein distance algorithm computes the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. This is a metric distance that satisfies the triangle inequality.
class Levenshtein(MetricStringDistance):
def distance(self, s0: str, s1: str) -> float:
"""
Calculate Levenshtein distance between two strings.
Args:
s0: First string
s1: Second string
Returns:
float: Number of edit operations (minimum 0, no maximum limit)
Raises:
TypeError: If either string is None
"""Usage Example:
from similarity.levenshtein import Levenshtein
lev = Levenshtein()
distance = lev.distance('kitten', 'sitting') # Returns: 3.0
distance = lev.distance('hello', 'hello') # Returns: 0.0A normalized version of Levenshtein distance that returns values in the range [0.0, 1.0] by dividing the edit distance by the length of the longest string. Also provides similarity as 1 - normalized distance.
class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):
def __init__(self): ...
def distance(self, s0: str, s1: str) -> float:
"""
Calculate normalized Levenshtein distance (0.0-1.0).
Args:
s0: First string
s1: Second string
Returns:
float: Normalized distance where 0.0 = identical, 1.0 = completely different
Raises:
TypeError: If either string is None
"""
def similarity(self, s0: str, s1: str) -> float:
"""
Calculate normalized Levenshtein similarity (0.0-1.0).
Args:
s0: First string
s1: Second string
Returns:
float: Normalized similarity where 1.0 = identical, 0.0 = completely different
Raises:
TypeError: If either string is None
"""Usage Example:
from similarity.normalized_levenshtein import NormalizedLevenshtein
norm_lev = NormalizedLevenshtein()
distance = norm_lev.distance('kitten', 'sitting') # Returns: ~0.428
similarity = norm_lev.similarity('kitten', 'sitting') # Returns: ~0.571An implementation of Levenshtein distance that allows custom weights for different character operations. Commonly used for OCR applications where certain character substitutions are more likely, or keyboard typing auto-correction where adjacent keys have lower substitution costs.
class WeightedLevenshtein(StringDistance):
def __init__(self, character_substitution: CharacterSubstitutionInterface,
character_ins_del: CharacterInsDelInterface = None):
"""
Initialize with custom cost interfaces.
Args:
character_substitution: Interface defining substitution costs
character_ins_del: Optional interface for insertion/deletion costs (default: 1.0)
"""
def distance(self, s0: str, s1: str) -> float:
"""
Calculate weighted Levenshtein distance with custom operation costs.
Args:
s0: First string
s1: Second string
Returns:
float: Weighted edit distance
Raises:
TypeError: If either string is None
"""
class CharacterSubstitutionInterface:
def cost(self, c0: str, c1: str) -> float:
"""
Return the cost of substituting character c0 with c1.
Args:
c0: Original character
c1: Replacement character
Returns:
float: Cost of substitution (typically 0.0-1.0 range)
"""
class CharacterInsDelInterface:
def deletion_cost(self, c: str) -> float:
"""
Return the cost of deleting character c.
Args:
c: Character to delete
Returns:
float: Cost of deletion
"""
def insertion_cost(self, c: str) -> float:
"""
Return the cost of inserting character c.
Args:
c: Character to insert
Returns:
float: Cost of insertion
"""Usage Example:
from similarity.weighted_levenshtein import WeightedLevenshtein, CharacterSubstitutionInterface
class OCRSubstitution(CharacterSubstitutionInterface):
def cost(self, c0, c1):
# Lower cost for visually similar characters
if (c0, c1) in [('o', '0'), ('l', '1'), ('S', '5')]:
return 0.3
return 1.0
weighted_lev = WeightedLevenshtein(OCRSubstitution())
distance = weighted_lev.distance('S0me1ext', 'Some1ext') # Lower cost due to custom weightsA variant of Damerau-Levenshtein that computes edit distance with the restriction that no substring can be edited more than once. Supports insertions, deletions, substitutions, and adjacent character transpositions.
class OptimalStringAlignment(StringDistance):
def distance(self, s0: str, s1: str) -> float:
"""
Calculate Optimal String Alignment distance.
Args:
s0: First string
s1: Second string
Returns:
float: OSA distance (minimum 0, no maximum limit)
Raises:
TypeError: If either string is None
"""Usage Example:
from similarity.optimal_string_alignment import OptimalStringAlignment
osa = OptimalStringAlignment()
distance = osa.distance('CA', 'ABC') # Returns: 3.0
distance = osa.distance('hello', 'ehllo') # Returns: 1.0 (transposition)Install with Tessl CLI
npx tessl i tessl/pypi-strsim