A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.
npx @tessl/cli install tessl/pypi-set-similarity-search@1.0.0A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.
pip install SetSimilaritySearchconda install -c conda-forge setsimilaritysearchfrom SetSimilaritySearch import all_pairs, SearchIndexfrom SetSimilaritySearch import all_pairs, SearchIndex
# Example sets for similarity search
sets = [
[1, 2, 3, 4],
[2, 3, 4, 5],
[1, 3, 5, 7],
[4, 5, 6, 7]
]
# Find all pairs with Jaccard similarity >= 0.3
print("All pairs with similarity >= 0.3:")
for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):
print(f"Sets {x} and {y}: similarity = {sim:.3f}")
# Build search index for query-based search
index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)
# Query for sets similar to a target set
query_set = [2, 3, 4]
results = index.query(query_set)
print(f"\nSets similar to {query_set}:")
for idx, sim in results:
print(f"Set {idx}: similarity = {sim:.3f}")SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:
Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
"""
Find all pairs of sets with similarity greater than a threshold.
Parameters:
- sets: list of iterables representing sets
- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")
- similarity_threshold: float, threshold between 0 and 1.0
Returns:
Iterator of tuples (x, y, similarity) where x, y are set indices
Raises:
ValueError: For invalid inputs or unsupported similarity functions
"""Provides an indexed data structure for efficient similarity queries against large collections of sets.
class SearchIndex:
def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
"""
Build search index for set similarity queries.
Parameters:
- sets: list of iterables representing sets
- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")
- similarity_threshold: float, threshold between 0 and 1.0
Raises:
ValueError: For invalid inputs or unsupported similarity functions
"""
def query(self, s):
"""
Query for sets similar to the input set.
Parameters:
- s: iterable representing the query set
Returns:
list of tuples (index, similarity)
"""Provides a command-line interface for batch processing of set similarity operations directly from files.
# Command line script: all_pairs.py
# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD
# Parameters:
# --input-sets: One file for self all-pairs, two files for cross-collection pairs
# --output-pairs: Output CSV file with results
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
# --similarity-threshold: Threshold value between 0 and 1
# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)
# --sample-k: Number of sampled sets from second file for queries (default: all)
# Input format: Each line contains "SetID Token" pairs
# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)The library raises ValueError exceptions for: