0
# Query-based Similarity Search
1
2
The SearchIndex class provides an efficient data structure for similarity search queries against large collections of sets. It combines prefix filter and position filter techniques to enable fast similarity queries without scanning the entire collection.
3
4
## Capabilities
5
6
### SearchIndex Class
7
8
A data structure that supports set similarity search queries with pre-built indexing for optimal query performance.
9
10
```python { .api }
11
class SearchIndex:
12
def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):
13
"""
14
Build a search index for set similarity queries.
15
The algorithm is a combination of the prefix filter and position filter
16
techniques.
17
18
Parameters:
19
- sets (list): a list of sets, each entry is an iterable representing a
20
set. Note, it is the caller's responsibility to ensure the elements
21
in each set are unique, and duplicate elements will cause incorrect
22
result.
23
- similarity_func_name (str): the name of the similarity function used;
24
this function currently supports "jaccard", "cosine", "containment",
25
and "containment_min".
26
- similarity_threshold (float): the threshold used, must be a float
27
between 0 and 1.0.
28
29
Raises:
30
ValueError: If sets is not a non-empty list, similarity function is not
31
supported, or threshold is out of range.
32
"""
33
34
def query(self, s):
35
"""
36
Query the search index for sets similar to the query set.
37
38
Parameters:
39
- s (Iterable): the query set with unique elements.
40
41
Returns:
42
list: a list of tuples (index, similarity) where the index
43
is the index of the matching sets in the original list of sets.
44
"""
45
```
46
47
## Usage Examples
48
49
### Basic Query Search
50
51
```python
52
from SetSimilaritySearch import SearchIndex
53
54
# Build index on a collection of sets
55
document_sets = [
56
["python", "programming", "language"],
57
["java", "programming", "language", "object"],
58
["python", "data", "science", "analysis"],
59
["java", "enterprise", "development"],
60
["programming", "tutorial", "beginner"]
61
]
62
63
# Create search index with Jaccard similarity
64
index = SearchIndex(
65
document_sets,
66
similarity_func_name="jaccard",
67
similarity_threshold=0.2
68
)
69
70
# Query for sets similar to a target
71
query_set = ["python", "programming"]
72
results = index.query(query_set)
73
74
print(f"Sets similar to {query_set}:")
75
for idx, similarity in results:
76
print(f" Set {idx}: {document_sets[idx]} (similarity: {similarity:.3f})")
77
```
78
79
### Using Different Similarity Functions
80
81
```python
82
from SetSimilaritySearch import SearchIndex
83
84
# Product feature sets
85
products = [
86
["wireless", "bluetooth", "headphones", "music"],
87
["bluetooth", "speaker", "portable", "music"],
88
["headphones", "noise", "canceling", "premium"],
89
["wireless", "earbuds", "compact", "sport"],
90
["speaker", "home", "smart", "voice"]
91
]
92
93
# Compare different similarity functions
94
query = ["wireless", "bluetooth", "music"]
95
96
# Jaccard similarity
97
jaccard_index = SearchIndex(products, "jaccard", 0.1)
98
jaccard_results = jaccard_index.query(query)
99
print("Jaccard results:")
100
for idx, sim in jaccard_results:
101
print(f" Product {idx}: {sim:.3f}")
102
103
# Cosine similarity
104
cosine_index = SearchIndex(products, "cosine", 0.1)
105
cosine_results = cosine_index.query(query)
106
print("\nCosine results:")
107
for idx, sim in cosine_results:
108
print(f" Product {idx}: {sim:.3f}")
109
110
# Containment similarity (query containment in indexed sets)
111
containment_index = SearchIndex(products, "containment", 0.3)
112
containment_results = containment_index.query(query)
113
print("\nContainment results:")
114
for idx, sim in containment_results:
115
print(f" Product {idx}: {sim:.3f}")
116
```
117
118
### High-Performance Querying
119
120
```python
121
from SetSimilaritySearch import SearchIndex
122
import time
123
124
# Large collection simulation
125
num_docs = 10000
126
vocabulary = [f"term_{i}" for i in range(1000)]
127
128
# Generate document sets
129
import random
130
documents = []
131
for i in range(num_docs):
132
doc_size = random.randint(10, 50)
133
doc_terms = random.sample(vocabulary, doc_size)
134
documents.append(doc_terms)
135
136
# Build index (one-time cost)
137
print("Building search index...")
138
start_time = time.time()
139
index = SearchIndex(documents, "jaccard", 0.3)
140
build_time = time.time() - start_time
141
print(f"Index built in {build_time:.2f} seconds")
142
143
# Perform multiple queries (fast)
144
query_times = []
145
for _ in range(100):
146
query_terms = random.sample(vocabulary, 15)
147
148
start_time = time.time()
149
results = index.query(query_terms)
150
query_time = time.time() - start_time
151
query_times.append(query_time)
152
153
avg_query_time = sum(query_times) / len(query_times)
154
print(f"Average query time: {avg_query_time*1000:.2f} ms")
155
print(f"Average results per query: {sum(len(index.query(random.sample(vocabulary, 15))) for _ in range(10))/10:.1f}")
156
```
157
158
### Working with Real-World Data
159
160
```python
161
from SetSimilaritySearch import SearchIndex
162
163
# Example: Finding similar user interests
164
user_interests = [
165
["photography", "travel", "nature", "hiking"],
166
["cooking", "food", "travel", "culture"],
167
["programming", "technology", "ai", "machine_learning"],
168
["photography", "art", "design", "creativity"],
169
["hiking", "outdoor", "fitness", "nature"],
170
["travel", "adventure", "photography", "culture"],
171
["technology", "programming", "startups", "innovation"]
172
]
173
174
# Build index for user matching
175
user_index = SearchIndex(
176
user_interests,
177
similarity_func_name="jaccard",
178
similarity_threshold=0.25
179
)
180
181
# Find users with similar interests
182
target_user_interests = ["travel", "photography", "adventure"]
183
similar_users = user_index.query(target_user_interests)
184
185
print(f"Users with interests similar to {target_user_interests}:")
186
for user_idx, similarity in similar_users:
187
print(f" User {user_idx}: {user_interests[user_idx]} (similarity: {similarity:.3f})")
188
```
189
190
## Supported Similarity Functions
191
192
The SearchIndex supports both symmetric and asymmetric similarity functions:
193
194
### Symmetric Functions
195
- **jaccard**: |A ∩ B| / |A ∪ B|
196
- **cosine**: |A ∩ B| / √(|A| × |B|)
197
- **containment_min**: |A ∩ B| / max(|A|, |B|)
198
199
### Asymmetric Functions
200
- **containment**: |A ∩ B| / |A| (query containment in indexed sets)
201
202
## Index Structure and Performance
203
204
### Index Building Process
205
1. **Frequency Ordering**: Tokens are transformed to integers based on global frequency
206
2. **Prefix Computation**: Each set's prefix is calculated based on similarity threshold
207
3. **Inverted Index**: Maps tokens to (set_index, position) pairs for fast lookup
208
4. **Position Filters**: Precomputed filters for efficient candidate pruning
209
210
### Query Process
211
1. **Query Transformation**: Apply same frequency ordering to query set
212
2. **Candidate Generation**: Use prefix tokens to find potential matches
213
3. **Position Filtering**: Apply position filters to prune false candidates
214
4. **Similarity Verification**: Compute exact similarity for remaining candidates
215
216
### Performance Characteristics
217
- **Index Build Time**: O(total_tokens + vocabulary_size)
218
- **Query Time**: Sub-linear in collection size for most real datasets
219
- **Memory Usage**: O(vocabulary_size + total_prefixes)
220
- **Best Performance**: Sparse sets with skewed token distributions
221
222
## Error Conditions
223
224
The SearchIndex raises `ValueError` in these cases:
225
226
```python
227
# Empty or invalid input during construction
228
SearchIndex([], "jaccard", 0.5) # ValueError: sets must be non-empty list
229
230
# Unsupported similarity function
231
SearchIndex(sets, "invalid", 0.5) # ValueError: similarity function not supported
232
233
# Invalid threshold
234
SearchIndex(sets, "jaccard", -0.1) # ValueError: threshold must be in [0, 1]
235
SearchIndex(sets, "jaccard", 1.5) # ValueError: threshold must be in [0, 1]
236
237
# No errors on query - handles empty queries gracefully
238
index.query([]) # Returns empty list
239
index.query(["nonexistent_token"]) # Returns empty list
240
```
241
242
## Memory and Scalability Considerations
243
244
### Memory Usage
245
- Index size grows with vocabulary size and collection size
246
- Memory usage is approximately: `O(vocabulary_size + sum(prefix_sizes))`
247
- Typical memory usage: 10-100 MB for collections of 100K sets
248
249
### Scalability Guidelines
250
- **Small collections** (<1K sets): All similarity functions perform well
251
- **Medium collections** (1K-100K sets): Prefer Jaccard or Cosine for best performance
252
- **Large collections** (>100K sets): Consider vocabulary pruning for containment similarity
253
- **Very large collections**: May require distributed indexing approaches