or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

all-pairs-search.mdcommand-line.mdindex.mdquery-search.md
tile.json

tessl/pypi-set-similarity-search

A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/setsimilaritysearch@1.0.x

To install, run

npx @tessl/cli install tessl/pypi-set-similarity-search@1.0.0

index.mddocs/

SetSimilaritySearch

A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.

Package Information

  • Package Name: SetSimilaritySearch
  • Language: Python
  • Installation:
    • Pip: pip install SetSimilaritySearch
    • Conda: conda install -c conda-forge setsimilaritysearch

Core Imports

from SetSimilaritySearch import all_pairs, SearchIndex

Basic Usage

from SetSimilaritySearch import all_pairs, SearchIndex

# Example sets for similarity search
sets = [
    [1, 2, 3, 4],
    [2, 3, 4, 5], 
    [1, 3, 5, 7],
    [4, 5, 6, 7]
]

# Find all pairs with Jaccard similarity >= 0.3
print("All pairs with similarity >= 0.3:")
for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):
    print(f"Sets {x} and {y}: similarity = {sim:.3f}")

# Build search index for query-based search
index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)

# Query for sets similar to a target set
query_set = [2, 3, 4]
results = index.query(query_set)
print(f"\nSets similar to {query_set}:")
for idx, sim in results:
    print(f"Set {idx}: similarity = {sim:.3f}")

Architecture

SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:

  • All-Pairs Search: Efficient algorithm for finding all similar pairs in a collection
  • Query-based Search: Index structure for fast similarity queries against large collections
  • Similarity Functions: Support for Jaccard, Cosine, and Containment similarity measures
  • Position Filters: Mathematical optimization techniques that significantly improve performance by pruning impossible candidates
  • Frequency Ordering: Token transformation based on global frequency to speed up prefix filtering and similarity computation
  • Performance: Often faster than approximate methods like MinHash LSH while providing exact results

Capabilities

All-Pairs Similarity Search

Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.

def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
    """
    Find all pairs of sets with similarity greater than a threshold.
    
    Parameters:
    - sets: list of iterables representing sets
    - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")
    - similarity_threshold: float, threshold between 0 and 1.0
    
    Returns:
    Iterator of tuples (x, y, similarity) where x, y are set indices
    
    Raises:
    ValueError: For invalid inputs or unsupported similarity functions
    """

All-Pairs Search

Query-based Similarity Search

Provides an indexed data structure for efficient similarity queries against large collections of sets.

class SearchIndex:
    def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
        """
        Build search index for set similarity queries.
        
        Parameters:
        - sets: list of iterables representing sets
        - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")
        - similarity_threshold: float, threshold between 0 and 1.0
        
        Raises:
        ValueError: For invalid inputs or unsupported similarity functions
        """
    
    def query(self, s):
        """
        Query for sets similar to the input set.
        
        Parameters:
        - s: iterable representing the query set
        
        Returns:
        list of tuples (index, similarity)
        """

Query-based Search

Command Line Tool

Provides a command-line interface for batch processing of set similarity operations directly from files.

# Command line script: all_pairs.py
# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD

# Parameters:
# --input-sets: One file for self all-pairs, two files for cross-collection pairs
# --output-pairs: Output CSV file with results  
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
# --similarity-threshold: Threshold value between 0 and 1
# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)
# --sample-k: Number of sampled sets from second file for queries (default: all)

# Input format: Each line contains "SetID Token" pairs
# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)

Command Line Usage

Supported Similarity Functions

  • jaccard: Jaccard similarity coefficient (symmetric) - |A ∩ B| / |A ∪ B|
  • cosine: Cosine similarity (symmetric) - |A ∩ B| / √(|A| × |B|)
  • containment: Containment similarity (asymmetric, SearchIndex only) - |A ∩ B| / |A|
  • containment_min: Minimum containment similarity (symmetric) - |A ∩ B| / max(|A|, |B|)

Dependencies

  • numpy: Required for efficient array operations and similarity computations

Error Handling

The library raises ValueError exceptions for:

  • Empty or invalid input sets
  • Unsupported similarity function names
  • Similarity thresholds outside the range [0, 1]
  • Asymmetric similarity functions used with all_pairs (which requires symmetric functions)