or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-set-similarity-search

A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/setsimilaritysearch@1.0.x

To install, run

npx @tessl/cli install tessl/pypi-set-similarity-search@1.0.0

0

# SetSimilaritySearch

1

2

A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.

3

4

## Package Information

5

6

- **Package Name**: SetSimilaritySearch

7

- **Language**: Python

8

- **Installation**:

9

- Pip: `pip install SetSimilaritySearch`

10

- Conda: `conda install -c conda-forge setsimilaritysearch`

11

12

## Core Imports

13

14

```python

15

from SetSimilaritySearch import all_pairs, SearchIndex

16

```

17

18

## Basic Usage

19

20

```python

21

from SetSimilaritySearch import all_pairs, SearchIndex

22

23

# Example sets for similarity search

24

sets = [

25

[1, 2, 3, 4],

26

[2, 3, 4, 5],

27

[1, 3, 5, 7],

28

[4, 5, 6, 7]

29

]

30

31

# Find all pairs with Jaccard similarity >= 0.3

32

print("All pairs with similarity >= 0.3:")

33

for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):

34

print(f"Sets {x} and {y}: similarity = {sim:.3f}")

35

36

# Build search index for query-based search

37

index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)

38

39

# Query for sets similar to a target set

40

query_set = [2, 3, 4]

41

results = index.query(query_set)

42

print(f"\nSets similar to {query_set}:")

43

for idx, sim in results:

44

print(f"Set {idx}: similarity = {sim:.3f}")

45

```

46

47

## Architecture

48

49

SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:

50

51

- **All-Pairs Search**: Efficient algorithm for finding all similar pairs in a collection

52

- **Query-based Search**: Index structure for fast similarity queries against large collections

53

- **Similarity Functions**: Support for Jaccard, Cosine, and Containment similarity measures

54

- **Position Filters**: Mathematical optimization techniques that significantly improve performance by pruning impossible candidates

55

- **Frequency Ordering**: Token transformation based on global frequency to speed up prefix filtering and similarity computation

56

- **Performance**: Often faster than approximate methods like MinHash LSH while providing exact results

57

58

## Capabilities

59

60

### All-Pairs Similarity Search

61

62

Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.

63

64

```python { .api }

65

def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):

66

"""

67

Find all pairs of sets with similarity greater than a threshold.

68

69

Parameters:

70

- sets: list of iterables representing sets

71

- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")

72

- similarity_threshold: float, threshold between 0 and 1.0

73

74

Returns:

75

Iterator of tuples (x, y, similarity) where x, y are set indices

76

77

Raises:

78

ValueError: For invalid inputs or unsupported similarity functions

79

"""

80

```

81

82

[All-Pairs Search](./all-pairs-search.md)

83

84

### Query-based Similarity Search

85

86

Provides an indexed data structure for efficient similarity queries against large collections of sets.

87

88

```python { .api }

89

class SearchIndex:

90

def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):

91

"""

92

Build search index for set similarity queries.

93

94

Parameters:

95

- sets: list of iterables representing sets

96

- similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")

97

- similarity_threshold: float, threshold between 0 and 1.0

98

99

Raises:

100

ValueError: For invalid inputs or unsupported similarity functions

101

"""

102

103

def query(self, s):

104

"""

105

Query for sets similar to the input set.

106

107

Parameters:

108

- s: iterable representing the query set

109

110

Returns:

111

list of tuples (index, similarity)

112

"""

113

```

114

115

[Query-based Search](./query-search.md)

116

117

### Command Line Tool

118

119

Provides a command-line interface for batch processing of set similarity operations directly from files.

120

121

```python { .api }

122

# Command line script: all_pairs.py

123

# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD

124

125

# Parameters:

126

# --input-sets: One file for self all-pairs, two files for cross-collection pairs

127

# --output-pairs: Output CSV file with results

128

# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)

129

# --similarity-threshold: Threshold value between 0 and 1

130

# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)

131

# --sample-k: Number of sampled sets from second file for queries (default: all)

132

133

# Input format: Each line contains "SetID Token" pairs

134

# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)

135

```

136

137

[Command Line Usage](./command-line.md)

138

139

## Supported Similarity Functions

140

141

- **jaccard**: Jaccard similarity coefficient (symmetric) - |A ∩ B| / |A ∪ B|

142

- **cosine**: Cosine similarity (symmetric) - |A ∩ B| / √(|A| × |B|)

143

- **containment**: Containment similarity (asymmetric, SearchIndex only) - |A ∩ B| / |A|

144

- **containment_min**: Minimum containment similarity (symmetric) - |A ∩ B| / max(|A|, |B|)

145

146

## Dependencies

147

148

- **numpy**: Required for efficient array operations and similarity computations

149

150

## Error Handling

151

152

The library raises `ValueError` exceptions for:

153

- Empty or invalid input sets

154

- Unsupported similarity function names

155

- Similarity thresholds outside the range [0, 1]

156

- Asymmetric similarity functions used with all_pairs (which requires symmetric functions)