or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

all-pairs-search.mdcommand-line.mdindex.mdquery-search.md

query-search.mddocs/

0

# Query-based Similarity Search

1

2

The SearchIndex class provides an efficient data structure for similarity search queries against large collections of sets. It combines prefix filter and position filter techniques to enable fast similarity queries without scanning the entire collection.

3

4

## Capabilities

5

6

### SearchIndex Class

7

8

A data structure that supports set similarity search queries with pre-built indexing for optimal query performance.

9

10

```python { .api }

11

class SearchIndex:

12

def __init__(self, sets, similarity_func_name="jaccard", similarity_threshold=0.5):

13

"""

14

Build a search index for set similarity queries.

15

The algorithm is a combination of the prefix filter and position filter

16

techniques.

17

18

Parameters:

19

- sets (list): a list of sets, each entry is an iterable representing a

20

set. Note, it is the caller's responsibility to ensure the elements

21

in each set are unique, and duplicate elements will cause incorrect

22

result.

23

- similarity_func_name (str): the name of the similarity function used;

24

this function currently supports "jaccard", "cosine", "containment",

25

and "containment_min".

26

- similarity_threshold (float): the threshold used, must be a float

27

between 0 and 1.0.

28

29

Raises:

30

ValueError: If sets is not a non-empty list, similarity function is not

31

supported, or threshold is out of range.

32

"""

33

34

def query(self, s):

35

"""

36

Query the search index for sets similar to the query set.

37

38

Parameters:

39

- s (Iterable): the query set with unique elements.

40

41

Returns:

42

list: a list of tuples (index, similarity) where the index

43

is the index of the matching sets in the original list of sets.

44

"""

45

```

46

47

## Usage Examples

48

49

### Basic Query Search

50

51

```python

52

from SetSimilaritySearch import SearchIndex

53

54

# Build index on a collection of sets

55

document_sets = [

56

["python", "programming", "language"],

57

["java", "programming", "language", "object"],

58

["python", "data", "science", "analysis"],

59

["java", "enterprise", "development"],

60

["programming", "tutorial", "beginner"]

61

]

62

63

# Create search index with Jaccard similarity

64

index = SearchIndex(

65

document_sets,

66

similarity_func_name="jaccard",

67

similarity_threshold=0.2

68

)

69

70

# Query for sets similar to a target

71

query_set = ["python", "programming"]

72

results = index.query(query_set)

73

74

print(f"Sets similar to {query_set}:")

75

for idx, similarity in results:

76

print(f" Set {idx}: {document_sets[idx]} (similarity: {similarity:.3f})")

77

```

78

79

### Using Different Similarity Functions

80

81

```python

82

from SetSimilaritySearch import SearchIndex

83

84

# Product feature sets

85

products = [

86

["wireless", "bluetooth", "headphones", "music"],

87

["bluetooth", "speaker", "portable", "music"],

88

["headphones", "noise", "canceling", "premium"],

89

["wireless", "earbuds", "compact", "sport"],

90

["speaker", "home", "smart", "voice"]

91

]

92

93

# Compare different similarity functions

94

query = ["wireless", "bluetooth", "music"]

95

96

# Jaccard similarity

97

jaccard_index = SearchIndex(products, "jaccard", 0.1)

98

jaccard_results = jaccard_index.query(query)

99

print("Jaccard results:")

100

for idx, sim in jaccard_results:

101

print(f" Product {idx}: {sim:.3f}")

102

103

# Cosine similarity

104

cosine_index = SearchIndex(products, "cosine", 0.1)

105

cosine_results = cosine_index.query(query)

106

print("\nCosine results:")

107

for idx, sim in cosine_results:

108

print(f" Product {idx}: {sim:.3f}")

109

110

# Containment similarity (query containment in indexed sets)

111

containment_index = SearchIndex(products, "containment", 0.3)

112

containment_results = containment_index.query(query)

113

print("\nContainment results:")

114

for idx, sim in containment_results:

115

print(f" Product {idx}: {sim:.3f}")

116

```

117

118

### High-Performance Querying

119

120

```python

121

from SetSimilaritySearch import SearchIndex

122

import time

123

124

# Large collection simulation

125

num_docs = 10000

126

vocabulary = [f"term_{i}" for i in range(1000)]

127

128

# Generate document sets

129

import random

130

documents = []

131

for i in range(num_docs):

132

doc_size = random.randint(10, 50)

133

doc_terms = random.sample(vocabulary, doc_size)

134

documents.append(doc_terms)

135

136

# Build index (one-time cost)

137

print("Building search index...")

138

start_time = time.time()

139

index = SearchIndex(documents, "jaccard", 0.3)

140

build_time = time.time() - start_time

141

print(f"Index built in {build_time:.2f} seconds")

142

143

# Perform multiple queries (fast)

144

query_times = []

145

for _ in range(100):

146

query_terms = random.sample(vocabulary, 15)

147

148

start_time = time.time()

149

results = index.query(query_terms)

150

query_time = time.time() - start_time

151

query_times.append(query_time)

152

153

avg_query_time = sum(query_times) / len(query_times)

154

print(f"Average query time: {avg_query_time*1000:.2f} ms")

155

print(f"Average results per query: {sum(len(index.query(random.sample(vocabulary, 15))) for _ in range(10))/10:.1f}")

156

```

157

158

### Working with Real-World Data

159

160

```python

161

from SetSimilaritySearch import SearchIndex

162

163

# Example: Finding similar user interests

164

user_interests = [

165

["photography", "travel", "nature", "hiking"],

166

["cooking", "food", "travel", "culture"],

167

["programming", "technology", "ai", "machine_learning"],

168

["photography", "art", "design", "creativity"],

169

["hiking", "outdoor", "fitness", "nature"],

170

["travel", "adventure", "photography", "culture"],

171

["technology", "programming", "startups", "innovation"]

172

]

173

174

# Build index for user matching

175

user_index = SearchIndex(

176

user_interests,

177

similarity_func_name="jaccard",

178

similarity_threshold=0.25

179

)

180

181

# Find users with similar interests

182

target_user_interests = ["travel", "photography", "adventure"]

183

similar_users = user_index.query(target_user_interests)

184

185

print(f"Users with interests similar to {target_user_interests}:")

186

for user_idx, similarity in similar_users:

187

print(f" User {user_idx}: {user_interests[user_idx]} (similarity: {similarity:.3f})")

188

```

189

190

## Supported Similarity Functions

191

192

The SearchIndex supports both symmetric and asymmetric similarity functions:

193

194

### Symmetric Functions

195

- **jaccard**: |A ∩ B| / |A ∪ B|

196

- **cosine**: |A ∩ B| / √(|A| × |B|)

197

- **containment_min**: |A ∩ B| / max(|A|, |B|)

198

199

### Asymmetric Functions

200

- **containment**: |A ∩ B| / |A| (query containment in indexed sets)

201

202

## Index Structure and Performance

203

204

### Index Building Process

205

1. **Frequency Ordering**: Tokens are transformed to integers based on global frequency

206

2. **Prefix Computation**: Each set's prefix is calculated based on similarity threshold

207

3. **Inverted Index**: Maps tokens to (set_index, position) pairs for fast lookup

208

4. **Position Filters**: Precomputed filters for efficient candidate pruning

209

210

### Query Process

211

1. **Query Transformation**: Apply same frequency ordering to query set

212

2. **Candidate Generation**: Use prefix tokens to find potential matches

213

3. **Position Filtering**: Apply position filters to prune false candidates

214

4. **Similarity Verification**: Compute exact similarity for remaining candidates

215

216

### Performance Characteristics

217

- **Index Build Time**: O(total_tokens + vocabulary_size)

218

- **Query Time**: Sub-linear in collection size for most real datasets

219

- **Memory Usage**: O(vocabulary_size + total_prefixes)

220

- **Best Performance**: Sparse sets with skewed token distributions

221

222

## Error Conditions

223

224

The SearchIndex raises `ValueError` in these cases:

225

226

```python

227

# Empty or invalid input during construction

228

SearchIndex([], "jaccard", 0.5) # ValueError: sets must be non-empty list

229

230

# Unsupported similarity function

231

SearchIndex(sets, "invalid", 0.5) # ValueError: similarity function not supported

232

233

# Invalid threshold

234

SearchIndex(sets, "jaccard", -0.1) # ValueError: threshold must be in [0, 1]

235

SearchIndex(sets, "jaccard", 1.5) # ValueError: threshold must be in [0, 1]

236

237

# No errors on query - handles empty queries gracefully

238

index.query([]) # Returns empty list

239

index.query(["nonexistent_token"]) # Returns empty list

240

```

241

242

## Memory and Scalability Considerations

243

244

### Memory Usage

245

- Index size grows with vocabulary size and collection size

246

- Memory usage is approximately: `O(vocabulary_size + sum(prefix_sizes))`

247

- Typical memory usage: 10-100 MB for collections of 100K sets

248

249

### Scalability Guidelines

250

- **Small collections** (<1K sets): All similarity functions perform well

251

- **Medium collections** (1K-100K sets): Prefer Jaccard or Cosine for best performance

252

- **Large collections** (>100K sets): Consider vocabulary pruning for containment similarity

253

- **Very large collections**: May require distributed indexing approaches