0
# All-Pairs Similarity Search
1
2
Efficient implementation of the All-Pair-Binary algorithm with position filter enhancement for finding all pairs of sets with similarity above a specified threshold. This approach can significantly outperform brute-force methods and even MinHash LSH algorithms on certain datasets.
3
4
## Capabilities
5
6
### All-Pairs Function
7
8
Finds all pairs of sets with similarity greater than a threshold using optimized algorithms.
9
10
```python { .api }
11
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
12
"""
13
Find all pairs of sets with similarity greater than a threshold.
14
This is an implementation of the All-Pair-Binary algorithm in the paper
15
"Scaling Up All Pairs Similarity Search" by Bayardo et al., with
16
position filter enhancement.
17
18
Parameters:
19
- sets (list): a list of sets, each entry is an iterable representing a
20
set. Note, it is the caller's responsibility to ensure the elements
21
in each set are unique, and duplicate elements will cause incorrect
22
output.
23
- similarity_func_name (str): the name of the similarity function used;
24
this function currently supports "jaccard", "cosine", and "containment_min".
25
- similarity_threshold (float): the threshold used, must be a float
26
between 0 and 1.0.
27
28
Returns:
29
Iterator[tuple]: an iterator of tuples (x, y, similarity)
30
where x and y are the indices of sets in the input list sets.
31
32
Raises:
33
ValueError: If sets is not a non-empty list, similarity function is not
34
supported, threshold is out of range, or similarity function is not
35
symmetric.
36
"""
37
```
38
39
## Usage Examples
40
41
### Basic All-Pairs Search
42
43
```python
44
from SetSimilaritySearch import all_pairs
45
46
# Example dataset
47
sets = [
48
["apple", "banana", "cherry"],
49
["banana", "cherry", "date"],
50
["apple", "cherry", "elderberry"],
51
["date", "elderberry", "fig"]
52
]
53
54
# Find all pairs with Jaccard similarity >= 0.4
55
pairs = list(all_pairs(
56
sets,
57
similarity_func_name="jaccard",
58
similarity_threshold=0.4
59
))
60
61
print("Similar pairs found:")
62
for x, y, sim in pairs:
63
print(f"Set {x} and Set {y}: {sim:.3f}")
64
print(f" Set {x}: {sets[x]}")
65
print(f" Set {y}: {sets[y]}")
66
```
67
68
### Using Different Similarity Functions
69
70
```python
71
from SetSimilaritySearch import all_pairs
72
73
# Numeric sets for demonstration
74
numeric_sets = [
75
[1, 2, 3, 4, 5],
76
[3, 4, 5, 6, 7],
77
[1, 3, 5, 7, 9],
78
[2, 4, 6, 8, 10]
79
]
80
81
# Compare Jaccard vs Cosine similarity
82
print("Jaccard similarity (threshold=0.2):")
83
for x, y, sim in all_pairs(numeric_sets, "jaccard", 0.2):
84
print(f"Sets {x},{y}: {sim:.3f}")
85
86
print("\nCosine similarity (threshold=0.2):")
87
for x, y, sim in all_pairs(numeric_sets, "cosine", 0.2):
88
print(f"Sets {x},{y}: {sim:.3f}")
89
```
90
91
### Working with Large Datasets
92
93
```python
94
import random
95
from SetSimilaritySearch import all_pairs
96
97
# Generate larger dataset for performance demonstration
98
num_sets = 1000
99
universe_size = 10000
100
set_size = 100
101
102
# Create random sets
103
large_sets = []
104
for i in range(num_sets):
105
# Generate random set
106
elements = random.sample(range(universe_size), set_size)
107
large_sets.append(elements)
108
109
# Find similar pairs efficiently
110
similar_pairs = list(all_pairs(
111
large_sets,
112
similarity_func_name="jaccard",
113
similarity_threshold=0.7
114
))
115
116
print(f"Found {len(similar_pairs)} similar pairs out of {num_sets} sets")
117
```
118
119
## Supported Similarity Functions
120
121
The all_pairs function supports only **symmetric similarity functions**:
122
123
### Jaccard Similarity
124
- **Formula**: |A ∩ B| / |A ∪ B|
125
- **Range**: [0, 1]
126
- **Use case**: General set similarity, good for sparse sets
127
128
### Cosine Similarity
129
- **Formula**: |A ∩ B| / √(|A| × |B|)
130
- **Range**: [0, 1]
131
- **Use case**: When set sizes vary significantly
132
133
### Containment Min Similarity
134
- **Formula**: |A ∩ B| / max(|A|, |B|)
135
- **Range**: [0, 1]
136
- **Use case**: Symmetric containment measure, good for sets of varying sizes
137
138
## Algorithm Details
139
140
The implementation uses the All-Pair-Binary algorithm with optimizations:
141
142
1. **Frequency Ordering**: Transforms tokens to integers based on global frequency to speed up operations
143
2. **Prefix Filtering**: Uses mathematical bounds to avoid checking impossible pairs
144
3. **Position Filtering**: Further optimization using positional information in ordered sets
145
4. **Early Termination**: Stops computation when similarity bounds cannot be met
146
147
## Performance Characteristics
148
149
- **Time Complexity**: Significantly better than O(n²) brute force for most real datasets
150
- **Space Complexity**: O(vocabulary_size + output_size)
151
- **Best Performance**: On datasets with skewed token distributions and moderate similarity thresholds
152
- **Memory Usage**: Builds inverted index structure during processing
153
154
## Error Conditions
155
156
The function raises `ValueError` in these cases:
157
158
```python
159
# Empty or invalid input
160
all_pairs([], "jaccard", 0.5) # ValueError: sets must be non-empty list
161
162
# Unsupported similarity function
163
all_pairs(sets, "invalid", 0.5) # ValueError: similarity function not supported
164
165
# Invalid threshold
166
all_pairs(sets, "jaccard", -0.1) # ValueError: threshold must be in [0, 1]
167
all_pairs(sets, "jaccard", 1.5) # ValueError: threshold must be in [0, 1]
168
169
# Asymmetric similarity function
170
all_pairs(sets, "containment", 0.5) # ValueError: must be symmetric function
171
```