A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Efficient implementation of the All-Pair-Binary algorithm with position filter enhancement for finding all pairs of sets with similarity above a specified threshold. This approach can significantly outperform brute-force methods and even MinHash LSH algorithms on certain datasets.
Finds all pairs of sets with similarity greater than a threshold using optimized algorithms.
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
"""
Find all pairs of sets with similarity greater than a threshold.
This is an implementation of the All-Pair-Binary algorithm in the paper
"Scaling Up All Pairs Similarity Search" by Bayardo et al., with
position filter enhancement.
Parameters:
- sets (list): a list of sets, each entry is an iterable representing a
set. Note, it is the caller's responsibility to ensure the elements
in each set are unique, and duplicate elements will cause incorrect
output.
- similarity_func_name (str): the name of the similarity function used;
this function currently supports "jaccard", "cosine", and "containment_min".
- similarity_threshold (float): the threshold used, must be a float
between 0 and 1.0.
Returns:
Iterator[tuple]: an iterator of tuples (x, y, similarity)
where x and y are the indices of sets in the input list sets.
Raises:
ValueError: If sets is not a non-empty list, similarity function is not
supported, threshold is out of range, or similarity function is not
symmetric.
"""from SetSimilaritySearch import all_pairs
# Example dataset
sets = [
["apple", "banana", "cherry"],
["banana", "cherry", "date"],
["apple", "cherry", "elderberry"],
["date", "elderberry", "fig"]
]
# Find all pairs with Jaccard similarity >= 0.4
pairs = list(all_pairs(
sets,
similarity_func_name="jaccard",
similarity_threshold=0.4
))
print("Similar pairs found:")
for x, y, sim in pairs:
print(f"Set {x} and Set {y}: {sim:.3f}")
print(f" Set {x}: {sets[x]}")
print(f" Set {y}: {sets[y]}")from SetSimilaritySearch import all_pairs
# Numeric sets for demonstration
numeric_sets = [
[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[1, 3, 5, 7, 9],
[2, 4, 6, 8, 10]
]
# Compare Jaccard vs Cosine similarity
print("Jaccard similarity (threshold=0.2):")
for x, y, sim in all_pairs(numeric_sets, "jaccard", 0.2):
print(f"Sets {x},{y}: {sim:.3f}")
print("\nCosine similarity (threshold=0.2):")
for x, y, sim in all_pairs(numeric_sets, "cosine", 0.2):
print(f"Sets {x},{y}: {sim:.3f}")import random
from SetSimilaritySearch import all_pairs
# Generate larger dataset for performance demonstration
num_sets = 1000
universe_size = 10000
set_size = 100
# Create random sets
large_sets = []
for i in range(num_sets):
# Generate random set
elements = random.sample(range(universe_size), set_size)
large_sets.append(elements)
# Find similar pairs efficiently
similar_pairs = list(all_pairs(
large_sets,
similarity_func_name="jaccard",
similarity_threshold=0.7
))
print(f"Found {len(similar_pairs)} similar pairs out of {num_sets} sets")The all_pairs function supports only symmetric similarity functions:
The implementation uses the All-Pair-Binary algorithm with optimizations:
The function raises ValueError in these cases:
# Empty or invalid input
all_pairs([], "jaccard", 0.5) # ValueError: sets must be non-empty list
# Unsupported similarity function
all_pairs(sets, "invalid", 0.5) # ValueError: similarity function not supported
# Invalid threshold
all_pairs(sets, "jaccard", -0.1) # ValueError: threshold must be in [0, 1]
all_pairs(sets, "jaccard", 1.5) # ValueError: threshold must be in [0, 1]
# Asymmetric similarity function
all_pairs(sets, "containment", 0.5) # ValueError: must be symmetric functionInstall with Tessl CLI
npx tessl i tessl/pypi-set-similarity-search