Tessl Tile for pypi/setsimilaritysearch@1.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

all-pairs-search.md command-line.md index.md query-search.md

command-line.mddocs/

0
# Command Line Tool
1

2
The SetSimilaritySearch package includes a command-line script `all_pairs.py` for batch processing of set similarity operations. This tool is useful for processing large datasets stored in files without writing custom Python code.
3

4
## Capabilities
5

6
### All-Pairs Command Line Interface
7

8
Process set similarity operations from command line with file input/output support.
9

10
```bash { .api }
11
all_pairs.py --input-sets FILE [FILE] \
12
             --output-pairs OUTPUT \
13
             --similarity-func FUNC \
14
             --similarity-threshold THRESHOLD \
15
             [--reversed-tuple BOOL] \
16
             [--sample-k INT]
17

18
# Parameters:
19
# --input-sets: Input file(s) with SetID-Token pairs
20
#   - One file: Computes all-pairs within the collection (self-join)
21
#   - Two files: Computes cross-collection pairs (join between collections)
22
# --output-pairs: Output CSV file path for results
23
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
24
# --similarity-threshold: Similarity threshold (float between 0 and 1)
25
# --reversed-tuple: Whether input format is "Token SetID" instead of "SetID Token" (default: false)
26
# --sample-k: Number of sets to sample from second file for queries (default: use all sets)
27
```
28

29
## Usage Examples
30

31
### Self All-Pairs Processing
32

33
Find all similar pairs within a single collection:
34

35
```bash
36
# Input file format (example_sets.txt):
37
# Each line: SetID Token
38
# Lines starting with # are ignored
39
1 apple
40
1 banana
41
1 cherry
42
2 banana
43
2 cherry
44
2 date
45
3 apple
46
3 elderberry
47

48
# Run all-pairs search
49
all_pairs.py --input-sets example_sets.txt \
50
             --output-pairs results.csv \
51
             --similarity-func jaccard \
52
             --similarity-threshold 0.3
53
```
54

55
Output CSV format:
56
```csv
57
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
58
2,1,3,3,0.500
59
3,1,2,3,0.200
60
```
61

62
### Cross-Collection Processing
63

64
Find similar pairs between two different collections:
65

66
```bash
67
# Collection 1 (documents.txt)
68
doc1 word1
69
doc1 word2
70
doc1 word3
71
doc2 word2
72
doc2 word4
73

74
# Collection 2 (queries.txt)  
75
query1 word1
76
query1 word2
77
query2 word3
78
query2 word4
79

80
# Find cross-collection similarities
81
all_pairs.py --input-sets documents.txt queries.txt \
82
             --output-pairs cross_results.csv \
83
             --similarity-func cosine \
84
             --similarity-threshold 0.1
85
```
86

87
### Large Dataset Processing with Sampling
88

89
Process large datasets efficiently by sampling queries:
90

91
```bash
92
# Process only 1000 sampled queries from the second collection
93
all_pairs.py --input-sets large_index.txt large_queries.txt \
94
             --output-pairs sampled_results.csv \
95
             --similarity-func jaccard \
96
             --similarity-threshold 0.5 \
97
             --sample-k 1000
98
```
99

100
### Reversed Tuple Format
101

102
Handle input files where tokens come before set IDs:
103

104
```bash
105
# Input format: Token SetID (instead of SetID Token)
106
apple doc1
107
banana doc1
108
cherry doc1
109
banana doc2
110

111
all_pairs.py --input-sets reversed_format.txt \
112
             --output-pairs results.csv \
113
             --similarity-func jaccard \
114
             --similarity-threshold 0.2 \
115
             --reversed-tuple true
116
```
117

118
## Input File Format
119

120
The input files must follow a specific format:
121

122
```
123
# Comments start with # and are ignored
124
# Each line contains exactly two space/tab-separated values
125
# Format: SetID Token (or Token SetID if --reversed-tuple is used)
126
# Every line must be unique
127

128
1 apple
129
1 banana  
130
1 cherry
131
2 banana
132
2 cherry
133
2 date
134
3 apple
135
3 elderberry
136
```
137

138
Requirements:
139
- Each line represents one element belonging to one set
140
- SetID and Token are separated by whitespace (space or tab)
141
- All (SetID, Token) pairs must be unique
142
- Lines starting with `#` are treated as comments and ignored
143
- Empty lines are ignored
144

145
## Output Format
146

147
The command-line tool outputs results in CSV format with the following columns:
148

149
| Column | Description |
150
|--------|-------------|
151
| `set_ID_x` | ID of the first set in the pair |
152
| `set_ID_y` | ID of the second set in the pair |
153
| `set_size_x` | Number of elements in the first set |
154
| `set_size_y` | Number of elements in the second set |  
155
| `similarity` | Computed similarity value |
156
157
Example output:
158
```csv
159
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
160
doc2,doc1,3,4,0.500
161
doc3,doc1,2,4,0.200
162
doc3,doc2,2,3,0.333
163
```
164

165
## Performance Considerations
166

167
### Memory Usage
168
- The tool loads entire collections into memory
169
- For very large datasets, consider splitting into smaller batches
170
- Memory usage scales with vocabulary size and collection size
171

172
### Processing Time
173
- Self all-pairs: Processing time depends on collection size and similarity threshold
174
- Cross-collection: Indexing time + query time for each set in second collection
175
- Lower similarity thresholds may significantly increase processing time and output size
176

177
### Optimization Tips
178
- Use higher similarity thresholds to reduce computation time
179
- For cross-collection processing, put the larger collection as the first input file (index)
180
- Use sampling (`--sample-k`) for initial exploration of large query collections
181
- Consider the trade-offs between different similarity functions based on your data characteristics
182

183
## Error Handling
184

185
The command-line tool will exit with error messages for:
186

187
- Invalid similarity function names
188
- Similarity thresholds outside [0, 1] range  
189
- Missing or unreadable input files
190
- Invalid input file formats
191
- Insufficient memory for large datasets
192

193
Common error scenarios:
194
- Duplicate (SetID, Token) pairs in input files
195
- Mixed tuple formats within the same file
196
- Insufficient disk space for output files

Version

Tile

Files

command-line.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

command-line.mddocs/