tessl/pypi-strsim

A library implementing different string similarity and distance measures

Overview

Eval results

Files

Sequence-Based Algorithms

Name: tessl/pypi-strsim
Author: tessl

Algorithms based on longest common subsequences and sequence alignment techniques. These algorithms are commonly used in diff utilities, version control systems, bioinformatics applications, and other domains where sequence comparison is important.

Capabilities

Longest Common Subsequence (LCS)

Computes the distance based on the longest common subsequence between two strings. Unlike substrings, subsequences don't need to be contiguous, making this algorithm useful for comparing strings with insertions and rearrangements.

class LongestCommonSubsequence(StringDistance):
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate LCS distance: len(s0) + len(s1) - 2 * LCS_length.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: LCS distance (minimum 0, maximum len(s0) + len(s1))
            
        Raises:
            TypeError: If either string is None
        """
    
    @staticmethod
    def length(s0: str, s1: str) -> float:
        """
        Calculate the length of the longest common subsequence.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Length of the longest common subsequence
            
        Raises:
            TypeError: If either string is None
        """

Usage Examples:

from similarity.longest_common_subsequence import LongestCommonSubsequence

lcs = LongestCommonSubsequence()

# Basic LCS distance
distance = lcs.distance('AGCAT', 'GAC')     # Returns: 4.0
distance = lcs.distance('AGCAT', 'AGCT')    # Returns: 1.0

# LCS length calculation
length = lcs.length('AGCAT', 'GAC')         # Returns: 2.0 (common: 'AC')
length = lcs.length('ABCDEF', 'ACBDEF')     # Returns: 5.0 (common: 'ACDEF')

# Practical example: comparing file versions
version1 = "def hello_world():\n    print('Hello')\n    return"
version2 = "def hello_world():\n    print('Hello, World!')\n    return"
lcs_len = lcs.length(version1, version2)
lcs_dist = lcs.distance(version1, version2)
print(f"Common characters: {lcs_len}")
print(f"LCS distance: {lcs_dist}")

Metric LCS

A normalized version of LCS distance that returns values in the range [0.0, 1.0] and satisfies the triangle inequality, making it a proper metric distance. Based on the paper "An LCS-based string metric" by Daniel Bakkelund.

class MetricLCS(MetricStringDistance, NormalizedStringDistance):
    def __init__(self):
        """Initialize MetricLCS with internal LCS instance."""
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate normalized metric LCS distance: 1 - |LCS(s0,s1)| / max(|s0|, |s1|).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
            
        Raises:
            TypeError: If either string is None
        """

Usage Examples:

from similarity.metric_lcs import MetricLCS

metric_lcs = MetricLCS()

# Basic normalized distance
distance = metric_lcs.distance('ABCDEFG', 'ABCDEFHJKL')  # Returns: 0.4
distance = metric_lcs.distance('ABDEF', 'ABDIF')         # Returns: ~0.2

# Practical comparison with different string lengths
strings = ['programming', 'programing', 'programs', 'grogram']
base = 'program'

for s in strings:
    dist = metric_lcs.distance(base, s)
    print(f"'{base}' vs '{s}': {dist:.3f}")

# Triangle inequality verification (metric property)
s1, s2, s3 = 'abc', 'def', 'ghi'
d12 = metric_lcs.distance(s1, s2)
d23 = metric_lcs.distance(s2, s3)
d13 = metric_lcs.distance(s1, s3)
print(f"Triangle inequality: {d13} <= {d12} + {d23} = {d12 + d23}")
assert d13 <= d12 + d23  # Should always be True for metric distances

Algorithm Comparison

LCS Distance is ideal for:

Diff utilities and version control
Comparing sequences where order matters but gaps are allowed
Applications needing raw distance values
When you don't need normalized results

Metric LCS is ideal for:

Applications requiring normalized distances (0.0-1.0)
When you need triangle inequality guarantees
Clustering or nearest-neighbor algorithms
Comparing strings of very different lengths

Comparative Example:

from similarity.longest_common_subsequence import LongestCommonSubsequence
from similarity.metric_lcs import MetricLCS

lcs = LongestCommonSubsequence()
metric_lcs = MetricLCS()

test_pairs = [
    ('ABCDEF', 'ACEF'),      # Deletions
    ('ABC', 'XAYBZC'),       # Insertions
    ('hello', 'world'),      # Different strings
    ('programming', 'program'), # Substring relationship
]

for s1, s2 in test_pairs:
    lcs_dist = lcs.distance(s1, s2)
    lcs_len = lcs.length(s1, s2)
    metric_dist = metric_lcs.distance(s1, s2)
    
    print(f"'{s1}' vs '{s2}':")
    print(f"  LCS length: {lcs_len}")
    print(f"  LCS distance: {lcs_dist}")
    print(f"  Metric LCS distance: {metric_dist:.3f}")
    print()

Implementation Notes

Both algorithms use dynamic programming with O(m*n) time complexity and space requirements, where m and n are the string lengths. The LCS algorithms are equivalent to Levenshtein distance when only insertions and deletions are allowed (no substitutions), or when the substitution cost is double the insertion/deletion cost.

Install with Tessl CLI