Tessl Tile for pypi/setsimilaritysearch@1.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

all-pairs-search.md command-line.md index.md query-search.md

all-pairs-search.mddocs/

0
# All-Pairs Similarity Search
1

2
Efficient implementation of the All-Pair-Binary algorithm with position filter enhancement for finding all pairs of sets with similarity above a specified threshold. This approach can significantly outperform brute-force methods and even MinHash LSH algorithms on certain datasets.
3

4
## Capabilities
5

6
### All-Pairs Function
7

8
Finds all pairs of sets with similarity greater than a threshold using optimized algorithms.
9

10
```python { .api }
11
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
12
    """
13
    Find all pairs of sets with similarity greater than a threshold.
14
    This is an implementation of the All-Pair-Binary algorithm in the paper
15
    "Scaling Up All Pairs Similarity Search" by Bayardo et al., with
16
    position filter enhancement.
17

18
    Parameters:
19
    - sets (list): a list of sets, each entry is an iterable representing a
20
        set. Note, it is the caller's responsibility to ensure the elements
21
        in each set are unique, and duplicate elements will cause incorrect
22
        output.
23
    - similarity_func_name (str): the name of the similarity function used;
24
        this function currently supports "jaccard", "cosine", and "containment_min".
25
    - similarity_threshold (float): the threshold used, must be a float
26
        between 0 and 1.0.
27

28
    Returns:
29
    Iterator[tuple]: an iterator of tuples (x, y, similarity)
30
        where x and y are the indices of sets in the input list sets.
31
        
32
    Raises:
33
    ValueError: If sets is not a non-empty list, similarity function is not 
34
        supported, threshold is out of range, or similarity function is not
35
        symmetric.
36
    """
37
```
38

39
## Usage Examples
40

41
### Basic All-Pairs Search
42

43
```python
44
from SetSimilaritySearch import all_pairs
45

46
# Example dataset
47
sets = [
48
    ["apple", "banana", "cherry"],
49
    ["banana", "cherry", "date"],
50
    ["apple", "cherry", "elderberry"],
51
    ["date", "elderberry", "fig"]
52
]
53

54
# Find all pairs with Jaccard similarity >= 0.4
55
pairs = list(all_pairs(
56
    sets, 
57
    similarity_func_name="jaccard", 
58
    similarity_threshold=0.4
59
))
60

61
print("Similar pairs found:")
62
for x, y, sim in pairs:
63
    print(f"Set {x} and Set {y}: {sim:.3f}")
64
    print(f"  Set {x}: {sets[x]}")
65
    print(f"  Set {y}: {sets[y]}")
66
```
67

68
### Using Different Similarity Functions
69

70
```python
71
from SetSimilaritySearch import all_pairs
72

73
# Numeric sets for demonstration
74
numeric_sets = [
75
    [1, 2, 3, 4, 5],
76
    [3, 4, 5, 6, 7],
77
    [1, 3, 5, 7, 9],
78
    [2, 4, 6, 8, 10]
79
]
80

81
# Compare Jaccard vs Cosine similarity
82
print("Jaccard similarity (threshold=0.2):")
83
for x, y, sim in all_pairs(numeric_sets, "jaccard", 0.2):
84
    print(f"Sets {x},{y}: {sim:.3f}")
85

86
print("\nCosine similarity (threshold=0.2):")
87
for x, y, sim in all_pairs(numeric_sets, "cosine", 0.2):
88
    print(f"Sets {x},{y}: {sim:.3f}")
89
```
90

91
### Working with Large Datasets
92

93
```python
94
import random
95
from SetSimilaritySearch import all_pairs
96

97
# Generate larger dataset for performance demonstration
98
num_sets = 1000
99
universe_size = 10000
100
set_size = 100
101

102
# Create random sets
103
large_sets = []
104
for i in range(num_sets):
105
    # Generate random set
106
    elements = random.sample(range(universe_size), set_size)
107
    large_sets.append(elements)
108

109
# Find similar pairs efficiently
110
similar_pairs = list(all_pairs(
111
    large_sets,
112
    similarity_func_name="jaccard",
113
    similarity_threshold=0.7
114
))
115

116
print(f"Found {len(similar_pairs)} similar pairs out of {num_sets} sets")
117
```
118

119
## Supported Similarity Functions
120

121
The all_pairs function supports only **symmetric similarity functions**:
122

123
### Jaccard Similarity
124
- **Formula**: |A ∩ B| / |A ∪ B|
125
- **Range**: [0, 1]
126
- **Use case**: General set similarity, good for sparse sets
127

128
### Cosine Similarity  
129
- **Formula**: |A ∩ B| / √(|A| × |B|)
130
- **Range**: [0, 1]
131
- **Use case**: When set sizes vary significantly
132

133
### Containment Min Similarity
134
- **Formula**: |A ∩ B| / max(|A|, |B|)
135
- **Range**: [0, 1]
136
- **Use case**: Symmetric containment measure, good for sets of varying sizes
137

138
## Algorithm Details
139

140
The implementation uses the All-Pair-Binary algorithm with optimizations:
141

142
1. **Frequency Ordering**: Transforms tokens to integers based on global frequency to speed up operations
143
2. **Prefix Filtering**: Uses mathematical bounds to avoid checking impossible pairs
144
3. **Position Filtering**: Further optimization using positional information in ordered sets
145
4. **Early Termination**: Stops computation when similarity bounds cannot be met
146

147
## Performance Characteristics
148

149
- **Time Complexity**: Significantly better than O(n²) brute force for most real datasets
150
- **Space Complexity**: O(vocabulary_size + output_size)
151
- **Best Performance**: On datasets with skewed token distributions and moderate similarity thresholds
152
- **Memory Usage**: Builds inverted index structure during processing
153

154
## Error Conditions
155

156
The function raises `ValueError` in these cases:
157

158
```python
159
# Empty or invalid input
160
all_pairs([], "jaccard", 0.5)  # ValueError: sets must be non-empty list
161

162
# Unsupported similarity function
163
all_pairs(sets, "invalid", 0.5)  # ValueError: similarity function not supported
164

165
# Invalid threshold
166
all_pairs(sets, "jaccard", -0.1)  # ValueError: threshold must be in [0, 1]
167
all_pairs(sets, "jaccard", 1.5)   # ValueError: threshold must be in [0, 1]
168

169
# Asymmetric similarity function
170
all_pairs(sets, "containment", 0.5)  # ValueError: must be symmetric function
171
```

Version

Tile

Files

all-pairs-search.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

all-pairs-search.mddocs/