CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-set-similarity-search

A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

index.mddocs/

SetSimilaritySearch

A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.

Package Information

  • Package Name: SetSimilaritySearch
  • Language: Python
  • Installation:
    • Pip: pip install SetSimilaritySearch
    • Conda: conda install -c conda-forge setsimilaritysearch

Core Imports

from SetSimilaritySearch import all_pairs, SearchIndex

Basic Usage

from SetSimilaritySearch import all_pairs, SearchIndex

# Example sets for similarity search
sets = [
    [1, 2, 3, 4],
    [2, 3, 4, 5], 
    [1, 3, 5, 7],
    [4, 5, 6, 7]
]

# Find all pairs with Jaccard similarity >= 0.3
print("All pairs with similarity >= 0.3:")
for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):
    print(f"Sets {x} and {y}: similarity = {sim:.3f}")

# Build search index for query-based search
index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)

# Query for sets similar to a target set
query_set = [2, 3, 4]
results = index.query(query_set)
print(f"\nSets similar to {query_set}:")
for idx, sim in results:
    print(f"Set {idx}: similarity = {sim:.3f}")

Architecture

SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:

  • All-Pairs Search: Efficient algorithm for finding all similar pairs in a collection
  • Query-based Search: Index structure for fast similarity queries against large collections
  • Similarity Functions: Support for Jaccard, Cosine, and Containment similarity measures
  • Position Filters: Mathematical optimization techniques that significantly improve performance by pruning impossible candidates
  • Frequency Ordering: Token transformation based on global frequency to speed up prefix filtering and similarity computation
  • Performance: Often faster than approximate methods like MinHash LSH while providing exact results

Capabilities

All-Pairs Similarity Search

Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.

def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
    """
    Find all pairs of sets with similarity greater than a threshold.
    
    Parameters:
    - sets: list of iterables representing sets
    - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")
    - similarity_threshold: float, threshold between 0 and 1.0
    
    Returns:
    Iterator of tuples (x, y, similarity) where x, y are set indices
    
    Raises:
    ValueError: For invalid inputs or unsupported similarity functions
    """

All-Pairs Search

Query-based Similarity Search

Provides an indexed data structure for efficient similarity queries against large collections of sets.

class SearchIndex:
    def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
        """
        Build search index for set similarity queries.
        
        Parameters:
        - sets: list of iterables representing sets
        - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")
        - similarity_threshold: float, threshold between 0 and 1.0
        
        Raises:
        ValueError: For invalid inputs or unsupported similarity functions
        """
    
    def query(self, s):
        """
        Query for sets similar to the input set.
        
        Parameters:
        - s: iterable representing the query set
        
        Returns:
        list of tuples (index, similarity)
        """

Query-based Search

Command Line Tool

Provides a command-line interface for batch processing of set similarity operations directly from files.

# Command line script: all_pairs.py
# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD

# Parameters:
# --input-sets: One file for self all-pairs, two files for cross-collection pairs
# --output-pairs: Output CSV file with results  
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
# --similarity-threshold: Threshold value between 0 and 1
# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)
# --sample-k: Number of sampled sets from second file for queries (default: all)

# Input format: Each line contains "SetID Token" pairs
# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)

Command Line Usage

Supported Similarity Functions

  • jaccard: Jaccard similarity coefficient (symmetric) - |A ∩ B| / |A ∪ B|
  • cosine: Cosine similarity (symmetric) - |A ∩ B| / √(|A| × |B|)
  • containment: Containment similarity (asymmetric, SearchIndex only) - |A ∩ B| / |A|
  • containment_min: Minimum containment similarity (symmetric) - |A ∩ B| / max(|A|, |B|)

Dependencies

  • numpy: Required for efficient array operations and similarity computations

Error Handling

The library raises ValueError exceptions for:

  • Empty or invalid input sets
  • Unsupported similarity function names
  • Similarity thresholds outside the range [0, 1]
  • Asymmetric similarity functions used with all_pairs (which requires symmetric functions)

Install with Tessl CLI

npx tessl i tessl/pypi-set-similarity-search

docs

all-pairs-search.md

command-line.md

index.md

query-search.md

tile.json