A library implementing different string similarity and distance measures
Algorithms based on longest common subsequences and sequence alignment techniques. These algorithms are commonly used in diff utilities, version control systems, bioinformatics applications, and other domains where sequence comparison is important.
Computes the distance based on the longest common subsequence between two strings. Unlike substrings, subsequences don't need to be contiguous, making this algorithm useful for comparing strings with insertions and rearrangements.
class LongestCommonSubsequence(StringDistance):
def distance(self, s0: str, s1: str) -> float:
"""
Calculate LCS distance: len(s0) + len(s1) - 2 * LCS_length.
Args:
s0: First string
s1: Second string
Returns:
float: LCS distance (minimum 0, maximum len(s0) + len(s1))
Raises:
TypeError: If either string is None
"""
@staticmethod
def length(s0: str, s1: str) -> float:
"""
Calculate the length of the longest common subsequence.
Args:
s0: First string
s1: Second string
Returns:
float: Length of the longest common subsequence
Raises:
TypeError: If either string is None
"""Usage Examples:
from similarity.longest_common_subsequence import LongestCommonSubsequence
lcs = LongestCommonSubsequence()
# Basic LCS distance
distance = lcs.distance('AGCAT', 'GAC') # Returns: 4.0
distance = lcs.distance('AGCAT', 'AGCT') # Returns: 1.0
# LCS length calculation
length = lcs.length('AGCAT', 'GAC') # Returns: 2.0 (common: 'AC')
length = lcs.length('ABCDEF', 'ACBDEF') # Returns: 5.0 (common: 'ACDEF')
# Practical example: comparing file versions
version1 = "def hello_world():\n print('Hello')\n return"
version2 = "def hello_world():\n print('Hello, World!')\n return"
lcs_len = lcs.length(version1, version2)
lcs_dist = lcs.distance(version1, version2)
print(f"Common characters: {lcs_len}")
print(f"LCS distance: {lcs_dist}")A normalized version of LCS distance that returns values in the range [0.0, 1.0] and satisfies the triangle inequality, making it a proper metric distance. Based on the paper "An LCS-based string metric" by Daniel Bakkelund.
class MetricLCS(MetricStringDistance, NormalizedStringDistance):
def __init__(self):
"""Initialize MetricLCS with internal LCS instance."""
def distance(self, s0: str, s1: str) -> float:
"""
Calculate normalized metric LCS distance: 1 - |LCS(s0,s1)| / max(|s0|, |s1|).
Args:
s0: First string
s1: Second string
Returns:
float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
Raises:
TypeError: If either string is None
"""Usage Examples:
from similarity.metric_lcs import MetricLCS
metric_lcs = MetricLCS()
# Basic normalized distance
distance = metric_lcs.distance('ABCDEFG', 'ABCDEFHJKL') # Returns: 0.4
distance = metric_lcs.distance('ABDEF', 'ABDIF') # Returns: ~0.2
# Practical comparison with different string lengths
strings = ['programming', 'programing', 'programs', 'grogram']
base = 'program'
for s in strings:
dist = metric_lcs.distance(base, s)
print(f"'{base}' vs '{s}': {dist:.3f}")
# Triangle inequality verification (metric property)
s1, s2, s3 = 'abc', 'def', 'ghi'
d12 = metric_lcs.distance(s1, s2)
d23 = metric_lcs.distance(s2, s3)
d13 = metric_lcs.distance(s1, s3)
print(f"Triangle inequality: {d13} <= {d12} + {d23} = {d12 + d23}")
assert d13 <= d12 + d23 # Should always be True for metric distancesLCS Distance is ideal for:
Metric LCS is ideal for:
Comparative Example:
from similarity.longest_common_subsequence import LongestCommonSubsequence
from similarity.metric_lcs import MetricLCS
lcs = LongestCommonSubsequence()
metric_lcs = MetricLCS()
test_pairs = [
('ABCDEF', 'ACEF'), # Deletions
('ABC', 'XAYBZC'), # Insertions
('hello', 'world'), # Different strings
('programming', 'program'), # Substring relationship
]
for s1, s2 in test_pairs:
lcs_dist = lcs.distance(s1, s2)
lcs_len = lcs.length(s1, s2)
metric_dist = metric_lcs.distance(s1, s2)
print(f"'{s1}' vs '{s2}':")
print(f" LCS length: {lcs_len}")
print(f" LCS distance: {lcs_dist}")
print(f" Metric LCS distance: {metric_dist:.3f}")
print()Both algorithms use dynamic programming with O(m*n) time complexity and space requirements, where m and n are the string lengths. The LCS algorithms are equivalent to Levenshtein distance when only insertions and deletions are allowed (no substitutions), or when the substitution cost is double the insertion/deletion cost.
Install with Tessl CLI
npx tessl i tessl/pypi-strsim