tessl/pypi-strsim

A library implementing different string similarity and distance measures

Overview

Eval results

Files

Edit Distance Algorithms

Name: tessl/pypi-strsim
Author: tessl

Classic edit distance algorithms that measure the minimum number of character operations (insertions, deletions, substitutions, and sometimes transpositions) needed to transform one string into another. These algorithms are fundamental to many text processing and comparison tasks.

Capabilities

Levenshtein Distance

The classic Levenshtein distance algorithm computes the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. This is a metric distance that satisfies the triangle inequality.

class Levenshtein(MetricStringDistance):
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate Levenshtein distance between two strings.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Number of edit operations (minimum 0, no maximum limit)
            
        Raises:
            TypeError: If either string is None
        """

Usage Example:

from similarity.levenshtein import Levenshtein

lev = Levenshtein()
distance = lev.distance('kitten', 'sitting')  # Returns: 3.0
distance = lev.distance('hello', 'hello')     # Returns: 0.0

Normalized Levenshtein

A normalized version of Levenshtein distance that returns values in the range [0.0, 1.0] by dividing the edit distance by the length of the longest string. Also provides similarity as 1 - normalized distance.

class NormalizedLevenshtein(NormalizedStringDistance, NormalizedStringSimilarity):
    def __init__(self): ...
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate normalized Levenshtein distance (0.0-1.0).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Normalized distance where 0.0 = identical, 1.0 = completely different
            
        Raises:
            TypeError: If either string is None
        """
    
    def similarity(self, s0: str, s1: str) -> float:
        """
        Calculate normalized Levenshtein similarity (0.0-1.0).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Normalized similarity where 1.0 = identical, 0.0 = completely different
            
        Raises:
            TypeError: If either string is None
        """

Usage Example:

from similarity.normalized_levenshtein import NormalizedLevenshtein

norm_lev = NormalizedLevenshtein()
distance = norm_lev.distance('kitten', 'sitting')    # Returns: ~0.428
similarity = norm_lev.similarity('kitten', 'sitting') # Returns: ~0.571

Weighted Levenshtein

An implementation of Levenshtein distance that allows custom weights for different character operations. Commonly used for OCR applications where certain character substitutions are more likely, or keyboard typing auto-correction where adjacent keys have lower substitution costs.

class WeightedLevenshtein(StringDistance):
    def __init__(self, character_substitution: CharacterSubstitutionInterface, 
                 character_ins_del: CharacterInsDelInterface = None):
        """
        Initialize with custom cost interfaces.
        
        Args:
            character_substitution: Interface defining substitution costs
            character_ins_del: Optional interface for insertion/deletion costs (default: 1.0)
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate weighted Levenshtein distance with custom operation costs.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Weighted edit distance
            
        Raises:
            TypeError: If either string is None
        """

class CharacterSubstitutionInterface:
    def cost(self, c0: str, c1: str) -> float:
        """
        Return the cost of substituting character c0 with c1.
        
        Args:
            c0: Original character
            c1: Replacement character
            
        Returns:
            float: Cost of substitution (typically 0.0-1.0 range)
        """

class CharacterInsDelInterface:
    def deletion_cost(self, c: str) -> float:
        """
        Return the cost of deleting character c.
        
        Args:
            c: Character to delete
            
        Returns:
            float: Cost of deletion
        """
    
    def insertion_cost(self, c: str) -> float:
        """
        Return the cost of inserting character c.
        
        Args:
            c: Character to insert
            
        Returns:
            float: Cost of insertion
        """

Usage Example:

from similarity.weighted_levenshtein import WeightedLevenshtein, CharacterSubstitutionInterface

class OCRSubstitution(CharacterSubstitutionInterface):
    def cost(self, c0, c1):
        # Lower cost for visually similar characters
        if (c0, c1) in [('o', '0'), ('l', '1'), ('S', '5')]:
            return 0.3
        return 1.0

weighted_lev = WeightedLevenshtein(OCRSubstitution())
distance = weighted_lev.distance('S0me1ext', 'Some1ext')  # Lower cost due to custom weights

Optimal String Alignment

A variant of Damerau-Levenshtein that computes edit distance with the restriction that no substring can be edited more than once. Supports insertions, deletions, substitutions, and adjacent character transpositions.

class OptimalStringAlignment(StringDistance):
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate Optimal String Alignment distance.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: OSA distance (minimum 0, no maximum limit)
            
        Raises:
            TypeError: If either string is None
        """

Usage Example:

from similarity.optimal_string_alignment import OptimalStringAlignment

osa = OptimalStringAlignment()
distance = osa.distance('CA', 'ABC')  # Returns: 3.0
distance = osa.distance('hello', 'ehllo')  # Returns: 1.0 (transposition)

Install with Tessl CLI