or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

all-pairs-search.mdcommand-line.mdindex.mdquery-search.md

all-pairs-search.mddocs/

0

# All-Pairs Similarity Search

1

2

Efficient implementation of the All-Pair-Binary algorithm with position filter enhancement for finding all pairs of sets with similarity above a specified threshold. This approach can significantly outperform brute-force methods and even MinHash LSH algorithms on certain datasets.

3

4

## Capabilities

5

6

### All-Pairs Function

7

8

Finds all pairs of sets with similarity greater than a threshold using optimized algorithms.

9

10

```python { .api }

11

def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):

12

"""

13

Find all pairs of sets with similarity greater than a threshold.

14

This is an implementation of the All-Pair-Binary algorithm in the paper

15

"Scaling Up All Pairs Similarity Search" by Bayardo et al., with

16

position filter enhancement.

17

18

Parameters:

19

- sets (list): a list of sets, each entry is an iterable representing a

20

set. Note, it is the caller's responsibility to ensure the elements

21

in each set are unique, and duplicate elements will cause incorrect

22

output.

23

- similarity_func_name (str): the name of the similarity function used;

24

this function currently supports "jaccard", "cosine", and "containment_min".

25

- similarity_threshold (float): the threshold used, must be a float

26

between 0 and 1.0.

27

28

Returns:

29

Iterator[tuple]: an iterator of tuples (x, y, similarity)

30

where x and y are the indices of sets in the input list sets.

31

32

Raises:

33

ValueError: If sets is not a non-empty list, similarity function is not

34

supported, threshold is out of range, or similarity function is not

35

symmetric.

36

"""

37

```

38

39

## Usage Examples

40

41

### Basic All-Pairs Search

42

43

```python

44

from SetSimilaritySearch import all_pairs

45

46

# Example dataset

47

sets = [

48

["apple", "banana", "cherry"],

49

["banana", "cherry", "date"],

50

["apple", "cherry", "elderberry"],

51

["date", "elderberry", "fig"]

52

]

53

54

# Find all pairs with Jaccard similarity >= 0.4

55

pairs = list(all_pairs(

56

sets,

57

similarity_func_name="jaccard",

58

similarity_threshold=0.4

59

))

60

61

print("Similar pairs found:")

62

for x, y, sim in pairs:

63

print(f"Set {x} and Set {y}: {sim:.3f}")

64

print(f" Set {x}: {sets[x]}")

65

print(f" Set {y}: {sets[y]}")

66

```

67

68

### Using Different Similarity Functions

69

70

```python

71

from SetSimilaritySearch import all_pairs

72

73

# Numeric sets for demonstration

74

numeric_sets = [

75

[1, 2, 3, 4, 5],

76

[3, 4, 5, 6, 7],

77

[1, 3, 5, 7, 9],

78

[2, 4, 6, 8, 10]

79

]

80

81

# Compare Jaccard vs Cosine similarity

82

print("Jaccard similarity (threshold=0.2):")

83

for x, y, sim in all_pairs(numeric_sets, "jaccard", 0.2):

84

print(f"Sets {x},{y}: {sim:.3f}")

85

86

print("\nCosine similarity (threshold=0.2):")

87

for x, y, sim in all_pairs(numeric_sets, "cosine", 0.2):

88

print(f"Sets {x},{y}: {sim:.3f}")

89

```

90

91

### Working with Large Datasets

92

93

```python

94

import random

95

from SetSimilaritySearch import all_pairs

96

97

# Generate larger dataset for performance demonstration

98

num_sets = 1000

99

universe_size = 10000

100

set_size = 100

101

102

# Create random sets

103

large_sets = []

104

for i in range(num_sets):

105

# Generate random set

106

elements = random.sample(range(universe_size), set_size)

107

large_sets.append(elements)

108

109

# Find similar pairs efficiently

110

similar_pairs = list(all_pairs(

111

large_sets,

112

similarity_func_name="jaccard",

113

similarity_threshold=0.7

114

))

115

116

print(f"Found {len(similar_pairs)} similar pairs out of {num_sets} sets")

117

```

118

119

## Supported Similarity Functions

120

121

The all_pairs function supports only **symmetric similarity functions**:

122

123

### Jaccard Similarity

124

- **Formula**: |A ∩ B| / |A ∪ B|

125

- **Range**: [0, 1]

126

- **Use case**: General set similarity, good for sparse sets

127

128

### Cosine Similarity

129

- **Formula**: |A ∩ B| / √(|A| × |B|)

130

- **Range**: [0, 1]

131

- **Use case**: When set sizes vary significantly

132

133

### Containment Min Similarity

134

- **Formula**: |A ∩ B| / max(|A|, |B|)

135

- **Range**: [0, 1]

136

- **Use case**: Symmetric containment measure, good for sets of varying sizes

137

138

## Algorithm Details

139

140

The implementation uses the All-Pair-Binary algorithm with optimizations:

141

142

1. **Frequency Ordering**: Transforms tokens to integers based on global frequency to speed up operations

143

2. **Prefix Filtering**: Uses mathematical bounds to avoid checking impossible pairs

144

3. **Position Filtering**: Further optimization using positional information in ordered sets

145

4. **Early Termination**: Stops computation when similarity bounds cannot be met

146

147

## Performance Characteristics

148

149

- **Time Complexity**: Significantly better than O(n²) brute force for most real datasets

150

- **Space Complexity**: O(vocabulary_size + output_size)

151

- **Best Performance**: On datasets with skewed token distributions and moderate similarity thresholds

152

- **Memory Usage**: Builds inverted index structure during processing

153

154

## Error Conditions

155

156

The function raises `ValueError` in these cases:

157

158

```python

159

# Empty or invalid input

160

all_pairs([], "jaccard", 0.5) # ValueError: sets must be non-empty list

161

162

# Unsupported similarity function

163

all_pairs(sets, "invalid", 0.5) # ValueError: similarity function not supported

164

165

# Invalid threshold

166

all_pairs(sets, "jaccard", -0.1) # ValueError: threshold must be in [0, 1]

167

all_pairs(sets, "jaccard", 1.5) # ValueError: threshold must be in [0, 1]

168

169

# Asymmetric similarity function

170

all_pairs(sets, "containment", 0.5) # ValueError: must be symmetric function

171

```