0
# SetSimilaritySearch
1
2
A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.
3
4
## Package Information
5
6
- **Package Name**: SetSimilaritySearch
7
- **Language**: Python
8
- **Installation**:
9
- Pip: `pip install SetSimilaritySearch`
10
- Conda: `conda install -c conda-forge setsimilaritysearch`
11
12
## Core Imports
13
14
```python
15
from SetSimilaritySearch import all_pairs, SearchIndex
16
```
17
18
## Basic Usage
19
20
```python
21
from SetSimilaritySearch import all_pairs, SearchIndex
22
23
# Example sets for similarity search
24
sets = [
25
[1, 2, 3, 4],
26
[2, 3, 4, 5],
27
[1, 3, 5, 7],
28
[4, 5, 6, 7]
29
]
30
31
# Find all pairs with Jaccard similarity >= 0.3
32
print("All pairs with similarity >= 0.3:")
33
for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):
34
print(f"Sets {x} and {y}: similarity = {sim:.3f}")
35
36
# Build search index for query-based search
37
index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)
38
39
# Query for sets similar to a target set
40
query_set = [2, 3, 4]
41
results = index.query(query_set)
42
print(f"\nSets similar to {query_set}:")
43
for idx, sim in results:
44
print(f"Set {idx}: similarity = {sim:.3f}")
45
```
46
47
## Architecture
48
49
SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:
50
51
- **All-Pairs Search**: Efficient algorithm for finding all similar pairs in a collection
52
- **Query-based Search**: Index structure for fast similarity queries against large collections
53
- **Similarity Functions**: Support for Jaccard, Cosine, and Containment similarity measures
54
- **Position Filters**: Mathematical optimization techniques that significantly improve performance by pruning impossible candidates
55
- **Frequency Ordering**: Token transformation based on global frequency to speed up prefix filtering and similarity computation
56
- **Performance**: Often faster than approximate methods like MinHash LSH while providing exact results
57
58
## Capabilities
59
60
### All-Pairs Similarity Search
61
62
Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.
63
64
```python { .api }
65
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
66
"""
67
Find all pairs of sets with similarity greater than a threshold.
68
69
Parameters:
70
- sets: list of iterables representing sets
71
- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")
72
- similarity_threshold: float, threshold between 0 and 1.0
73
74
Returns:
75
Iterator of tuples (x, y, similarity) where x, y are set indices
76
77
Raises:
78
ValueError: For invalid inputs or unsupported similarity functions
79
"""
80
```
81
82
[All-Pairs Search](./all-pairs-search.md)
83
84
### Query-based Similarity Search
85
86
Provides an indexed data structure for efficient similarity queries against large collections of sets.
87
88
```python { .api }
89
class SearchIndex:
90
def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
91
"""
92
Build search index for set similarity queries.
93
94
Parameters:
95
- sets: list of iterables representing sets
96
- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")
97
- similarity_threshold: float, threshold between 0 and 1.0
98
99
Raises:
100
ValueError: For invalid inputs or unsupported similarity functions
101
"""
102
103
def query(self, s):
104
"""
105
Query for sets similar to the input set.
106
107
Parameters:
108
- s: iterable representing the query set
109
110
Returns:
111
list of tuples (index, similarity)
112
"""
113
```
114
115
[Query-based Search](./query-search.md)
116
117
### Command Line Tool
118
119
Provides a command-line interface for batch processing of set similarity operations directly from files.
120
121
```python { .api }
122
# Command line script: all_pairs.py
123
# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD
124
125
# Parameters:
126
# --input-sets: One file for self all-pairs, two files for cross-collection pairs
127
# --output-pairs: Output CSV file with results
128
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
129
# --similarity-threshold: Threshold value between 0 and 1
130
# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)
131
# --sample-k: Number of sampled sets from second file for queries (default: all)
132
133
# Input format: Each line contains "SetID Token" pairs
134
# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)
135
```
136
137
[Command Line Usage](./command-line.md)
138
139
## Supported Similarity Functions
140
141
- **jaccard**: Jaccard similarity coefficient (symmetric) - |A ∩ B| / |A ∪ B|
142
- **cosine**: Cosine similarity (symmetric) - |A ∩ B| / √(|A| × |B|)
143
- **containment**: Containment similarity (asymmetric, SearchIndex only) - |A ∩ B| / |A|
144
- **containment_min**: Minimum containment similarity (symmetric) - |A ∩ B| / max(|A|, |B|)
145
146
## Dependencies
147
148
- **numpy**: Required for efficient array operations and similarity computations
149
150
## Error Handling
151
152
The library raises `ValueError` exceptions for:
153
- Empty or invalid input sets
154
- Unsupported similarity function names
155
- Similarity thresholds outside the range [0, 1]
156
- Asymmetric similarity functions used with all_pairs (which requires symmetric functions)