CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-set-similarity-search

A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

all-pairs-search.mddocs/

All-Pairs Similarity Search

Efficient implementation of the All-Pair-Binary algorithm with position filter enhancement for finding all pairs of sets with similarity above a specified threshold. This approach can significantly outperform brute-force methods and even MinHash LSH algorithms on certain datasets.

Capabilities

All-Pairs Function

Finds all pairs of sets with similarity greater than a threshold using optimized algorithms.

def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
    """
    Find all pairs of sets with similarity greater than a threshold.
    This is an implementation of the All-Pair-Binary algorithm in the paper
    "Scaling Up All Pairs Similarity Search" by Bayardo et al., with
    position filter enhancement.

    Parameters:
    - sets (list): a list of sets, each entry is an iterable representing a
        set. Note, it is the caller's responsibility to ensure the elements
        in each set are unique, and duplicate elements will cause incorrect
        output.
    - similarity_func_name (str): the name of the similarity function used;
        this function currently supports "jaccard", "cosine", and "containment_min".
    - similarity_threshold (float): the threshold used, must be a float
        between 0 and 1.0.

    Returns:
    Iterator[tuple]: an iterator of tuples (x, y, similarity)
        where x and y are the indices of sets in the input list sets.
        
    Raises:
    ValueError: If sets is not a non-empty list, similarity function is not 
        supported, threshold is out of range, or similarity function is not
        symmetric.
    """

Usage Examples

Basic All-Pairs Search

from SetSimilaritySearch import all_pairs

# Example dataset
sets = [
    ["apple", "banana", "cherry"],
    ["banana", "cherry", "date"],
    ["apple", "cherry", "elderberry"],
    ["date", "elderberry", "fig"]
]

# Find all pairs with Jaccard similarity >= 0.4
pairs = list(all_pairs(
    sets, 
    similarity_func_name="jaccard", 
    similarity_threshold=0.4
))

print("Similar pairs found:")
for x, y, sim in pairs:
    print(f"Set {x} and Set {y}: {sim:.3f}")
    print(f"  Set {x}: {sets[x]}")
    print(f"  Set {y}: {sets[y]}")

Using Different Similarity Functions

from SetSimilaritySearch import all_pairs

# Numeric sets for demonstration
numeric_sets = [
    [1, 2, 3, 4, 5],
    [3, 4, 5, 6, 7],
    [1, 3, 5, 7, 9],
    [2, 4, 6, 8, 10]
]

# Compare Jaccard vs Cosine similarity
print("Jaccard similarity (threshold=0.2):")
for x, y, sim in all_pairs(numeric_sets, "jaccard", 0.2):
    print(f"Sets {x},{y}: {sim:.3f}")

print("\nCosine similarity (threshold=0.2):")
for x, y, sim in all_pairs(numeric_sets, "cosine", 0.2):
    print(f"Sets {x},{y}: {sim:.3f}")

Working with Large Datasets

import random
from SetSimilaritySearch import all_pairs

# Generate larger dataset for performance demonstration
num_sets = 1000
universe_size = 10000
set_size = 100

# Create random sets
large_sets = []
for i in range(num_sets):
    # Generate random set
    elements = random.sample(range(universe_size), set_size)
    large_sets.append(elements)

# Find similar pairs efficiently
similar_pairs = list(all_pairs(
    large_sets,
    similarity_func_name="jaccard",
    similarity_threshold=0.7
))

print(f"Found {len(similar_pairs)} similar pairs out of {num_sets} sets")

Supported Similarity Functions

The all_pairs function supports only symmetric similarity functions:

Jaccard Similarity

  • Formula: |A ∩ B| / |A ∪ B|
  • Range: [0, 1]
  • Use case: General set similarity, good for sparse sets

Cosine Similarity

  • Formula: |A ∩ B| / √(|A| × |B|)
  • Range: [0, 1]
  • Use case: When set sizes vary significantly

Containment Min Similarity

  • Formula: |A ∩ B| / max(|A|, |B|)
  • Range: [0, 1]
  • Use case: Symmetric containment measure, good for sets of varying sizes

Algorithm Details

The implementation uses the All-Pair-Binary algorithm with optimizations:

  1. Frequency Ordering: Transforms tokens to integers based on global frequency to speed up operations
  2. Prefix Filtering: Uses mathematical bounds to avoid checking impossible pairs
  3. Position Filtering: Further optimization using positional information in ordered sets
  4. Early Termination: Stops computation when similarity bounds cannot be met

Performance Characteristics

  • Time Complexity: Significantly better than O(n²) brute force for most real datasets
  • Space Complexity: O(vocabulary_size + output_size)
  • Best Performance: On datasets with skewed token distributions and moderate similarity thresholds
  • Memory Usage: Builds inverted index structure during processing

Error Conditions

The function raises ValueError in these cases:

# Empty or invalid input
all_pairs([], "jaccard", 0.5)  # ValueError: sets must be non-empty list

# Unsupported similarity function
all_pairs(sets, "invalid", 0.5)  # ValueError: similarity function not supported

# Invalid threshold
all_pairs(sets, "jaccard", -0.1)  # ValueError: threshold must be in [0, 1]
all_pairs(sets, "jaccard", 1.5)   # ValueError: threshold must be in [0, 1]

# Asymmetric similarity function
all_pairs(sets, "containment", 0.5)  # ValueError: must be symmetric function

Install with Tessl CLI

npx tessl i tessl/pypi-set-similarity-search

docs

all-pairs-search.md

command-line.md

index.md

query-search.md

tile.json