Tessl Tile for pypi/setsimilaritysearch@1.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

all-pairs-search.md command-line.md index.md query-search.md

index.mddocs/

0
# SetSimilaritySearch
1

2
A Python library providing efficient algorithms for set similarity search operations. It implements both All-Pairs and Query search patterns using Jaccard, Cosine, and Containment similarity functions with optimized position filter techniques that can significantly outperform brute-force approaches.
3

4
## Package Information
5

6
- **Package Name**: SetSimilaritySearch
7
- **Language**: Python
8
- **Installation**: 
9
  - Pip: `pip install SetSimilaritySearch`
10
  - Conda: `conda install -c conda-forge setsimilaritysearch`
11

12
## Core Imports
13

14
```python
15
from SetSimilaritySearch import all_pairs, SearchIndex
16
```
17

18
## Basic Usage
19

20
```python
21
from SetSimilaritySearch import all_pairs, SearchIndex
22

23
# Example sets for similarity search
24
sets = [
25
    [1, 2, 3, 4],
26
    [2, 3, 4, 5], 
27
    [1, 3, 5, 7],
28
    [4, 5, 6, 7]
29
]
30

31
# Find all pairs with Jaccard similarity >= 0.3
32
print("All pairs with similarity >= 0.3:")
33
for x, y, sim in all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.3):
34
    print(f"Sets {x} and {y}: similarity = {sim:.3f}")
35

36
# Build search index for query-based search
37
index = SearchIndex(sets, similarity_func_name="jaccard", similarity_threshold=0.2)
38

39
# Query for sets similar to a target set
40
query_set = [2, 3, 4]
41
results = index.query(query_set)
42
print(f"\nSets similar to {query_set}:")
43
for idx, sim in results:
44
    print(f"Set {idx}: similarity = {sim:.3f}")
45
```
46

47
## Architecture
48

49
SetSimilaritySearch implements the All-Pair-Binary algorithm with position filter enhancements from the "Scaling Up All Pairs Similarity Search" paper. This approach often outperforms brute-force algorithms and can even beat MinHash LSH on certain datasets by leveraging skewness in empirical distributions:
50

51
- **All-Pairs Search**: Efficient algorithm for finding all similar pairs in a collection
52
- **Query-based Search**: Index structure for fast similarity queries against large collections  
53
- **Similarity Functions**: Support for Jaccard, Cosine, and Containment similarity measures
54
- **Position Filters**: Mathematical optimization techniques that significantly improve performance by pruning impossible candidates
55
- **Frequency Ordering**: Token transformation based on global frequency to speed up prefix filtering and similarity computation
56
- **Performance**: Often faster than approximate methods like MinHash LSH while providing exact results
57

58
## Capabilities
59

60
### All-Pairs Similarity Search
61

62
Finds all pairs of sets with similarity above a threshold using the optimized All-Pair-Binary algorithm with position filter enhancement.
63

64
```python { .api }
65
def all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.5):
66
    """
67
    Find all pairs of sets with similarity greater than a threshold.
68
    
69
    Parameters:
70
    - sets: list of iterables representing sets
71
    - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment_min")
72
    - similarity_threshold: float, threshold between 0 and 1.0
73
    
74
    Returns:
75
    Iterator of tuples (x, y, similarity) where x, y are set indices
76
    
77
    Raises:
78
    ValueError: For invalid inputs or unsupported similarity functions
79
    """
80
```
81

82
[All-Pairs Search](./all-pairs-search.md)
83

84
### Query-based Similarity Search
85

86
Provides an indexed data structure for efficient similarity queries against large collections of sets.
87

88
```python { .api }
89
class SearchIndex:
90
    def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
91
        """
92
        Build search index for set similarity queries.
93
        
94
        Parameters:
95
        - sets: list of iterables representing sets
96
        - similarity_func_name: str, similarity function ("jaccard", "cosine", "containment", "containment_min")
97
        - similarity_threshold: float, threshold between 0 and 1.0
98
        
99
        Raises:
100
        ValueError: For invalid inputs or unsupported similarity functions
101
        """
102
    
103
    def query(self, s):
104
        """
105
        Query for sets similar to the input set.
106
        
107
        Parameters:
108
        - s: iterable representing the query set
109
        
110
        Returns:
111
        list of tuples (index, similarity)
112
        """
113
```
114

115
[Query-based Search](./query-search.md)
116

117
### Command Line Tool
118

119
Provides a command-line interface for batch processing of set similarity operations directly from files.
120

121
```python { .api }
122
# Command line script: all_pairs.py
123
# Usage: all_pairs.py --input-sets FILE [FILE] --output-pairs OUTPUT --similarity-func FUNC --similarity-threshold THRESHOLD
124

125
# Parameters:
126
# --input-sets: One file for self all-pairs, two files for cross-collection pairs
127
# --output-pairs: Output CSV file with results  
128
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
129
# --similarity-threshold: Threshold value between 0 and 1
130
# --reversed-tuple: Whether input tuples are (Token SetID) instead of (SetID Token)
131
# --sample-k: Number of sampled sets from second file for queries (default: all)
132

133
# Input format: Each line contains "SetID Token" pairs
134
# Output format: CSV with columns (set_ID_x, set_ID_y, set_size_x, set_size_y, similarity)
135
```
136

137
[Command Line Usage](./command-line.md)
138

139
## Supported Similarity Functions
140

141
- **jaccard**: Jaccard similarity coefficient (symmetric) - |A ∩ B| / |A ∪ B|
142
- **cosine**: Cosine similarity (symmetric) - |A ∩ B| / √(|A| × |B|)
143
- **containment**: Containment similarity (asymmetric, SearchIndex only) - |A ∩ B| / |A|
144
- **containment_min**: Minimum containment similarity (symmetric) - |A ∩ B| / max(|A|, |B|)
145

146
## Dependencies  
147

148
- **numpy**: Required for efficient array operations and similarity computations
149

150
## Error Handling
151

152
The library raises `ValueError` exceptions for:
153
- Empty or invalid input sets
154
- Unsupported similarity function names
155
- Similarity thresholds outside the range [0, 1]
156
- Asymmetric similarity functions used with all_pairs (which requires symmetric functions)

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/