Tessl Tile for pypi/setsimilaritysearch@1.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

all-pairs-search.md command-line.md index.md query-search.md

query-search.mddocs/

0
# Query-based Similarity Search
1

2
The SearchIndex class provides an efficient data structure for similarity search queries against large collections of sets. It combines prefix filter and position filter techniques to enable fast similarity queries without scanning the entire collection.
3

4
## Capabilities
5

6
### SearchIndex Class
7

8
A data structure that supports set similarity search queries with pre-built indexing for optimal query performance.
9

10
```python { .api }
11
class SearchIndex:
12
    def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
13
        """
14
        Build a search index for set similarity queries.
15
        The algorithm is a combination of the prefix filter and position filter
16
        techniques.
17

18
        Parameters:
19
        - sets (list): a list of sets, each entry is an iterable representing a
20
            set. Note, it is the caller's responsibility to ensure the elements
21
            in each set are unique, and duplicate elements will cause incorrect
22
            result.
23
        - similarity_func_name (str): the name of the similarity function used;
24
            this function currently supports "jaccard", "cosine", "containment",
25
            and "containment_min".
26
        - similarity_threshold (float): the threshold used, must be a float
27
            between 0 and 1.0.
28
            
29
        Raises:
30
        ValueError: If sets is not a non-empty list, similarity function is not
31
            supported, or threshold is out of range.
32
        """
33
    
34
    def query(self, s):
35
        """
36
        Query the search index for sets similar to the query set.
37

38
        Parameters:
39
        - s (Iterable): the query set with unique elements.
40

41
        Returns:
42
        list: a list of tuples (index, similarity) where the index
43
            is the index of the matching sets in the original list of sets.
44
        """
45
```
46

47
## Usage Examples
48

49
### Basic Query Search
50

51
```python
52
from SetSimilaritySearch import SearchIndex
53

54
# Build index on a collection of sets
55
document_sets = [
56
    ["python", "programming", "language"],
57
    ["java", "programming", "language", "object"],
58
    ["python", "data", "science", "analysis"],
59
    ["java", "enterprise", "development"],
60
    ["programming", "tutorial", "beginner"]
61
]
62

63
# Create search index with Jaccard similarity
64
index = SearchIndex(
65
    document_sets,
66
    similarity_func_name="jaccard",
67
    similarity_threshold=0.2
68
)
69

70
# Query for sets similar to a target
71
query_set = ["python", "programming"]
72
results = index.query(query_set)
73

74
print(f"Sets similar to {query_set}:")
75
for idx, similarity in results:
76
    print(f"  Set {idx}: {document_sets[idx]} (similarity: {similarity:.3f})")
77
```
78

79
### Using Different Similarity Functions
80

81
```python
82
from SetSimilaritySearch import SearchIndex
83

84
# Product feature sets
85
products = [
86
    ["wireless", "bluetooth", "headphones", "music"],
87
    ["bluetooth", "speaker", "portable", "music"],
88
    ["headphones", "noise", "canceling", "premium"],
89
    ["wireless", "earbuds", "compact", "sport"],
90
    ["speaker", "home", "smart", "voice"]
91
]
92

93
# Compare different similarity functions
94
query = ["wireless", "bluetooth", "music"]
95

96
# Jaccard similarity
97
jaccard_index = SearchIndex(products, "jaccard", 0.1)
98
jaccard_results = jaccard_index.query(query)
99
print("Jaccard results:")
100
for idx, sim in jaccard_results:
101
    print(f"  Product {idx}: {sim:.3f}")
102

103
# Cosine similarity
104
cosine_index = SearchIndex(products, "cosine", 0.1)
105
cosine_results = cosine_index.query(query)
106
print("\nCosine results:")
107
for idx, sim in cosine_results:
108
    print(f"  Product {idx}: {sim:.3f}")
109

110
# Containment similarity (query containment in indexed sets)
111
containment_index = SearchIndex(products, "containment", 0.3)
112
containment_results = containment_index.query(query)
113
print("\nContainment results:")
114
for idx, sim in containment_results:
115
    print(f"  Product {idx}: {sim:.3f}")
116
```
117

118
### High-Performance Querying
119

120
```python
121
from SetSimilaritySearch import SearchIndex
122
import time
123

124
# Large collection simulation
125
num_docs = 10000
126
vocabulary = [f"term_{i}" for i in range(1000)]
127

128
# Generate document sets
129
import random
130
documents = []
131
for i in range(num_docs):
132
    doc_size = random.randint(10, 50)
133
    doc_terms = random.sample(vocabulary, doc_size)
134
    documents.append(doc_terms)
135

136
# Build index (one-time cost)
137
print("Building search index...")
138
start_time = time.time()
139
index = SearchIndex(documents, "jaccard", 0.3)
140
build_time = time.time() - start_time
141
print(f"Index built in {build_time:.2f} seconds")
142

143
# Perform multiple queries (fast)
144
query_times = []
145
for _ in range(100):
146
    query_terms = random.sample(vocabulary, 15)
147
    
148
    start_time = time.time()
149
    results = index.query(query_terms)
150
    query_time = time.time() - start_time
151
    query_times.append(query_time)
152

153
avg_query_time = sum(query_times) / len(query_times)
154
print(f"Average query time: {avg_query_time*1000:.2f} ms")
155
print(f"Average results per query: {sum(len(index.query(random.sample(vocabulary, 15))) for _ in range(10))/10:.1f}")
156
```
157

158
### Working with Real-World Data
159

160
```python
161
from SetSimilaritySearch import SearchIndex
162

163
# Example: Finding similar user interests
164
user_interests = [
165
    ["photography", "travel", "nature", "hiking"],
166
    ["cooking", "food", "travel", "culture"],
167
    ["programming", "technology", "ai", "machine_learning"],
168
    ["photography", "art", "design", "creativity"],
169
    ["hiking", "outdoor", "fitness", "nature"],
170
    ["travel", "adventure", "photography", "culture"],
171
    ["technology", "programming", "startups", "innovation"]
172
]
173

174
# Build index for user matching
175
user_index = SearchIndex(
176
    user_interests,
177
    similarity_func_name="jaccard",
178
    similarity_threshold=0.25
179
)
180

181
# Find users with similar interests
182
target_user_interests = ["travel", "photography", "adventure"]
183
similar_users = user_index.query(target_user_interests)
184

185
print(f"Users with interests similar to {target_user_interests}:")
186
for user_idx, similarity in similar_users:
187
    print(f"  User {user_idx}: {user_interests[user_idx]} (similarity: {similarity:.3f})")
188
```
189

190
## Supported Similarity Functions
191

192
The SearchIndex supports both symmetric and asymmetric similarity functions:
193

194
### Symmetric Functions
195
- **jaccard**: |A ∩ B| / |A ∪ B|
196
- **cosine**: |A ∩ B| / √(|A| × |B|)
197
- **containment_min**: |A ∩ B| / max(|A|, |B|)
198

199
### Asymmetric Functions
200
- **containment**: |A ∩ B| / |A| (query containment in indexed sets)
201

202
## Index Structure and Performance
203

204
### Index Building Process
205
1. **Frequency Ordering**: Tokens are transformed to integers based on global frequency
206
2. **Prefix Computation**: Each set's prefix is calculated based on similarity threshold
207
3. **Inverted Index**: Maps tokens to (set_index, position) pairs for fast lookup
208
4. **Position Filters**: Precomputed filters for efficient candidate pruning
209

210
### Query Process
211
1. **Query Transformation**: Apply same frequency ordering to query set
212
2. **Candidate Generation**: Use prefix tokens to find potential matches
213
3. **Position Filtering**: Apply position filters to prune false candidates
214
4. **Similarity Verification**: Compute exact similarity for remaining candidates
215

216
### Performance Characteristics
217
- **Index Build Time**: O(total_tokens + vocabulary_size)
218
- **Query Time**: Sub-linear in collection size for most real datasets
219
- **Memory Usage**: O(vocabulary_size + total_prefixes)
220
- **Best Performance**: Sparse sets with skewed token distributions
221

222
## Error Conditions
223

224
The SearchIndex raises `ValueError` in these cases:
225

226
```python
227
# Empty or invalid input during construction
228
SearchIndex([], "jaccard", 0.5)  # ValueError: sets must be non-empty list
229

230
# Unsupported similarity function
231
SearchIndex(sets, "invalid", 0.5)  # ValueError: similarity function not supported
232

233
# Invalid threshold
234
SearchIndex(sets, "jaccard", -0.1)  # ValueError: threshold must be in [0, 1]
235
SearchIndex(sets, "jaccard", 1.5)   # ValueError: threshold must be in [0, 1]
236

237
# No errors on query - handles empty queries gracefully
238
index.query([])  # Returns empty list
239
index.query(["nonexistent_token"])  # Returns empty list
240
```
241

242
## Memory and Scalability Considerations
243

244
### Memory Usage
245
- Index size grows with vocabulary size and collection size
246
- Memory usage is approximately: `O(vocabulary_size + sum(prefix_sizes))`
247
- Typical memory usage: 10-100 MB for collections of 100K sets
248

249
### Scalability Guidelines
250
- **Small collections** (<1K sets): All similarity functions perform well
251
- **Medium collections** (1K-100K sets): Prefer Jaccard or Cosine for best performance
252
- **Large collections** (>100K sets): Consider vocabulary pruning for containment similarity
253
- **Very large collections**: May require distributed indexing approaches

Version

Tile

Files

query-search.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

query-search.mddocs/