tessl/pypi-strsim

A library implementing different string similarity and distance measures

Overview

Eval results

Files

N-Gram and Shingle-Based Algorithms

Name: tessl/pypi-strsim
Author: tessl

Algorithms that convert strings into sets or profiles of n-character sequences (shingles or n-grams) and compute similarity based on these representations. These algorithms support both direct string comparison and pre-computed profile optimization for large datasets.

Capabilities

Q-Gram Distance

Q-gram distance as defined by Ukkonen, computed as the L1 norm of the difference between character n-gram profiles. This algorithm provides a lower bound on Levenshtein distance but can be computed in O(m + n) time compared to Levenshtein's O(m*n).

class QGram(ShingleBased, StringDistance):
    def __init__(self, k: int = 3):
        """
        Initialize Q-Gram with shingle size.
        
        Args:
            k: Size of character n-grams (shingles) to use (default: 3)
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate Q-gram distance between two strings.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Q-gram distance (sum of absolute differences in n-gram counts)
            
        Raises:
            TypeError: If either string is None
        """
    
    @staticmethod
    def distance_profile(profile0: dict, profile1: dict) -> float:
        """
        Calculate Q-gram distance between pre-computed profiles.
        
        Args:
            profile0: N-gram profile dictionary for first string
            profile1: N-gram profile dictionary for second string
            
        Returns:
            float: Q-gram distance between profiles
        """

Usage Examples:

from similarity.qgram import QGram

# Basic Q-gram distance with different k values
qgram2 = QGram(2)  # Bigrams
qgram3 = QGram(3)  # Trigrams

distance = qgram2.distance('ABCD', 'ABCE')  # Returns: 2 (different endings)
distance = qgram3.distance('hello', 'hallo')  # Returns: 2 (different middle)

# Profile-based computation for large datasets
qgram = QGram(2)
strings = ['hello', 'hallo', 'hullo', 'hillo']

# Pre-compute profiles once
profiles = [qgram.get_profile(s) for s in strings]

# Compare all pairs efficiently
for i in range(len(strings)):
    for j in range(i + 1, len(strings)):
        dist = QGram.distance_profile(profiles[i], profiles[j])
        print(f"'{strings[i]}' vs '{strings[j]}': {dist}")

Cosine Similarity

Cosine similarity treats n-gram profiles as vectors and computes the cosine of the angle between them. This normalized measure is effective for comparing documents and text with different lengths.

class Cosine(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
    def __init__(self, k: int):
        """
        Initialize Cosine similarity with shingle size.
        
        Args:
            k: Size of character n-grams (shingles) to use
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate cosine distance (1 - cosine similarity).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Cosine distance in range [0.0, 1.0] where 0.0 = identical
            
        Raises:
            TypeError: If either string is None
        """
    
    def similarity(self, s0: str, s1: str) -> float:
        """
        Calculate cosine similarity between two strings.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Cosine similarity in range [0.0, 1.0] where 1.0 = identical
            
        Raises:
            TypeError: If either string is None
        """
    
    def similarity_profiles(self, profile0: dict, profile1: dict) -> float:
        """
        Calculate cosine similarity between pre-computed profiles.
        
        Args:
            profile0: N-gram profile dictionary for first string
            profile1: N-gram profile dictionary for second string
            
        Returns:
            float: Cosine similarity between profiles
        """

Usage Examples:

from similarity.cosine import Cosine

# Document similarity with different k values
cosine3 = Cosine(3)
cosine4 = Cosine(4)

# Text comparison
text1 = "The quick brown fox jumps over the lazy dog"
text2 = "A quick brown fox jumped over a lazy dog"

similarity = cosine3.similarity(text1, text2)
distance = cosine3.distance(text1, text2)
print(f"Cosine similarity: {similarity:.3f}")
print(f"Cosine distance: {distance:.3f}")

# Profile-based computation
cosine = Cosine(2)
documents = [
    "machine learning algorithms",
    "deep learning neural networks",
    "artificial intelligence systems",
    "natural language processing"
]

# Pre-compute profiles
profiles = [cosine.get_profile(doc) for doc in documents]

# Find most similar documents
max_sim = 0
best_pair = None
for i in range(len(documents)):
    for j in range(i + 1, len(documents)):
        sim = cosine.similarity_profiles(profiles[i], profiles[j])
        if sim > max_sim:
            max_sim = sim
            best_pair = (i, j)

if best_pair:
    i, j = best_pair
    print(f"Most similar: '{documents[i]}' and '{documents[j]}' ({max_sim:.3f})")

Jaccard Index

Jaccard index treats n-gram profiles as sets (ignoring counts) and computes the ratio of intersection to union. This metric is particularly effective for comparing texts with different vocabulary sizes.

class Jaccard(ShingleBased, MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity):
    def __init__(self, k: int):
        """
        Initialize Jaccard index with shingle size.
        
        Args:
            k: Size of character n-grams (shingles) to use
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate Jaccard distance (1 - Jaccard similarity).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Jaccard distance in range [0.0, 1.0] where 0.0 = identical
            
        Raises:
            TypeError: If either string is None
        """
    
    def similarity(self, s0: str, s1: str) -> float:
        """
        Calculate Jaccard similarity between two strings.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Jaccard similarity in range [0.0, 1.0] where 1.0 = identical
            
        Raises:
            TypeError: If either string is None
        """

Usage Examples:

from similarity.jaccard import Jaccard

# Set-based similarity
jaccard2 = Jaccard(2)
jaccard3 = Jaccard(3)

# Compare strings as character sets
similarity = jaccard2.similarity('hello', 'hallo')   # Set overlap
distance = jaccard2.distance('hello', 'hallo')

# Compare different length strings
similarity = jaccard3.similarity('programming', 'program')
print(f"'programming' vs 'program': {similarity:.3f}")

# Document deduplication example
documents = [
    "artificial intelligence machine learning",
    "machine learning artificial intelligence", 
    "deep learning neural networks",
    "neural networks for deep learning"
]

jaccard = Jaccard(3)
threshold = 0.5

print("Similar document pairs (Jaccard > 0.5):")
for i in range(len(documents)):
    for j in range(i + 1, len(documents)):
        sim = jaccard.similarity(documents[i], documents[j])
        if sim > threshold:
            print(f"  '{documents[i]}' vs '{documents[j]}': {sim:.3f}")

Sorensen-Dice Coefficient

Similar to Jaccard but uses a different formula: 2 * |intersection| / (|set1| + |set2|). This gives higher weight to matches and is often more sensitive to similarities than Jaccard.

class SorensenDice(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
    def __init__(self, k: int = 3):
        """
        Initialize Sorensen-Dice with shingle size.
        
        Args:
            k: Size of character n-grams (shingles) to use (default: 3)
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate Sorensen-Dice distance (1 - Sorensen-Dice similarity).
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Sorensen-Dice distance in range [0.0, 1.0] where 0.0 = identical
            
        Raises:
            TypeError: If either string is None
        """
    
    def similarity(self, s0: str, s1: str) -> float:
        """
        Calculate Sorensen-Dice similarity between two strings.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Sorensen-Dice similarity in range [0.0, 1.0] where 1.0 = identical
            
        Raises:
            TypeError: If either string is None
        """

Usage Examples:

from similarity.sorensen_dice import SorensenDice

# Basic Sorensen-Dice similarity
dice = SorensenDice(2)
similarity = dice.similarity('hello', 'hallo')
distance = dice.distance('hello', 'hallo')

# Compare with Jaccard
from similarity.jaccard import Jaccard
jaccard = Jaccard(2)

test_pairs = [
    ('programming', 'program'),
    ('hello', 'world'),
    ('similar', 'similarity')
]

for s1, s2 in test_pairs:
    dice_sim = dice.similarity(s1, s2)
    jaccard_sim = jaccard.similarity(s1, s2)
    print(f"'{s1}' vs '{s2}':")
    print(f"  Dice: {dice_sim:.3f}, Jaccard: {jaccard_sim:.3f}")

N-Gram Distance (Kondrak)

Normalized n-gram distance as defined by Kondrak, using special character affixing to weight initial characters more heavily. This algorithm is designed specifically for string similarity rather than set operations.

class NGram(NormalizedStringDistance):
    def __init__(self, n: int = 2):
        """
        Initialize N-Gram distance with gram size.
        
        Args:
            n: Size of character n-grams to use (default: 2)
        """
    
    def distance(self, s0: str, s1: str) -> float:
        """
        Calculate normalized N-gram distance with special character affixing.
        
        Args:
            s0: First string
            s1: Second string
            
        Returns:
            float: Normalized distance in range [0.0, 1.0] where 0.0 = identical
            
        Raises:
            TypeError: If either string is None
        """

Usage Examples:

from similarity.ngram import NGram

# Different n-gram sizes
twogram = NGram(2)
fourgram = NGram(4)

# String comparison with weighted prefixes
distance = twogram.distance('ABCD', 'ABTUIO')
print(f"2-gram distance: {distance:.3f}")

# Longer strings with different n-gram sizes
s1 = 'Adobe CreativeSuite 5 Master Collection from cheap 4zp'
s2 = 'Adobe CreativeSuite 5 Master Collection from cheap d1x'

distance2 = twogram.distance(s1, s2)
distance4 = fourgram.distance(s1, s2)
print(f"Similar strings - 2-gram: {distance2:.3f}, 4-gram: {distance4:.3f}")

ShingleBased Base Class

All shingle-based algorithms inherit from this base class which provides profile computation:

class ShingleBased:
    def __init__(self, k: int = 3):
        """
        Initialize with shingle size.
        
        Args:
            k: Size of character n-grams (shingles) to use
        """
    
    def get_k(self) -> int:
        """
        Get the current shingle size.
        
        Returns:
            int: Current shingle size
        """
    
    def get_profile(self, string: str) -> dict:
        """
        Convert string to n-gram profile (dictionary of n-gram counts).
        
        Args:
            string: Input string to profile
            
        Returns:
            dict: Dictionary mapping n-grams to their occurrence counts
        """

Profile Usage Example:

from similarity.cosine import Cosine

cosine = Cosine(3)

# Get n-gram profiles
text = "hello world"
profile = cosine.get_profile(text)
print(f"3-gram profile for '{text}': {profile}")

# Profiles can be reused for multiple comparisons
profile1 = cosine.get_profile("machine learning")
profile2 = cosine.get_profile("deep learning")
profile3 = cosine.get_profile("learning algorithms")

similarity_12 = cosine.similarity_profiles(profile1, profile2)
similarity_13 = cosine.similarity_profiles(profile1, profile3)
print(f"ML vs DL: {similarity_12:.3f}")
print(f"ML vs Algorithms: {similarity_13:.3f}")

Install with Tessl CLI