0
# Command Line Tool
1
2
The SetSimilaritySearch package includes a command-line script `all_pairs.py` for batch processing of set similarity operations. This tool is useful for processing large datasets stored in files without writing custom Python code.
3
4
## Capabilities
5
6
### All-Pairs Command Line Interface
7
8
Process set similarity operations from command line with file input/output support.
9
10
```bash { .api }
11
all_pairs.py --input-sets FILE [FILE] \
12
--output-pairs OUTPUT \
13
--similarity-func FUNC \
14
--similarity-threshold THRESHOLD \
15
[--reversed-tuple BOOL] \
16
[--sample-k INT]
17
18
# Parameters:
19
# --input-sets: Input file(s) with SetID-Token pairs
20
# - One file: Computes all-pairs within the collection (self-join)
21
# - Two files: Computes cross-collection pairs (join between collections)
22
# --output-pairs: Output CSV file path for results
23
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
24
# --similarity-threshold: Similarity threshold (float between 0 and 1)
25
# --reversed-tuple: Whether input format is "Token SetID" instead of "SetID Token" (default: false)
26
# --sample-k: Number of sets to sample from second file for queries (default: use all sets)
27
```
28
29
## Usage Examples
30
31
### Self All-Pairs Processing
32
33
Find all similar pairs within a single collection:
34
35
```bash
36
# Input file format (example_sets.txt):
37
# Each line: SetID Token
38
# Lines starting with # are ignored
39
1 apple
40
1 banana
41
1 cherry
42
2 banana
43
2 cherry
44
2 date
45
3 apple
46
3 elderberry
47
48
# Run all-pairs search
49
all_pairs.py --input-sets example_sets.txt \
50
--output-pairs results.csv \
51
--similarity-func jaccard \
52
--similarity-threshold 0.3
53
```
54
55
Output CSV format:
56
```csv
57
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
58
2,1,3,3,0.500
59
3,1,2,3,0.200
60
```
61
62
### Cross-Collection Processing
63
64
Find similar pairs between two different collections:
65
66
```bash
67
# Collection 1 (documents.txt)
68
doc1 word1
69
doc1 word2
70
doc1 word3
71
doc2 word2
72
doc2 word4
73
74
# Collection 2 (queries.txt)
75
query1 word1
76
query1 word2
77
query2 word3
78
query2 word4
79
80
# Find cross-collection similarities
81
all_pairs.py --input-sets documents.txt queries.txt \
82
--output-pairs cross_results.csv \
83
--similarity-func cosine \
84
--similarity-threshold 0.1
85
```
86
87
### Large Dataset Processing with Sampling
88
89
Process large datasets efficiently by sampling queries:
90
91
```bash
92
# Process only 1000 sampled queries from the second collection
93
all_pairs.py --input-sets large_index.txt large_queries.txt \
94
--output-pairs sampled_results.csv \
95
--similarity-func jaccard \
96
--similarity-threshold 0.5 \
97
--sample-k 1000
98
```
99
100
### Reversed Tuple Format
101
102
Handle input files where tokens come before set IDs:
103
104
```bash
105
# Input format: Token SetID (instead of SetID Token)
106
apple doc1
107
banana doc1
108
cherry doc1
109
banana doc2
110
111
all_pairs.py --input-sets reversed_format.txt \
112
--output-pairs results.csv \
113
--similarity-func jaccard \
114
--similarity-threshold 0.2 \
115
--reversed-tuple true
116
```
117
118
## Input File Format
119
120
The input files must follow a specific format:
121
122
```
123
# Comments start with # and are ignored
124
# Each line contains exactly two space/tab-separated values
125
# Format: SetID Token (or Token SetID if --reversed-tuple is used)
126
# Every line must be unique
127
128
1 apple
129
1 banana
130
1 cherry
131
2 banana
132
2 cherry
133
2 date
134
3 apple
135
3 elderberry
136
```
137
138
Requirements:
139
- Each line represents one element belonging to one set
140
- SetID and Token are separated by whitespace (space or tab)
141
- All (SetID, Token) pairs must be unique
142
- Lines starting with `#` are treated as comments and ignored
143
- Empty lines are ignored
144
145
## Output Format
146
147
The command-line tool outputs results in CSV format with the following columns:
148
149
| Column | Description |
150
|--------|-------------|
151
| `set_ID_x` | ID of the first set in the pair |
152
| `set_ID_y` | ID of the second set in the pair |
153
| `set_size_x` | Number of elements in the first set |
154
| `set_size_y` | Number of elements in the second set |
155
| `similarity` | Computed similarity value |
156
157
Example output:
158
```csv
159
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
160
doc2,doc1,3,4,0.500
161
doc3,doc1,2,4,0.200
162
doc3,doc2,2,3,0.333
163
```
164
165
## Performance Considerations
166
167
### Memory Usage
168
- The tool loads entire collections into memory
169
- For very large datasets, consider splitting into smaller batches
170
- Memory usage scales with vocabulary size and collection size
171
172
### Processing Time
173
- Self all-pairs: Processing time depends on collection size and similarity threshold
174
- Cross-collection: Indexing time + query time for each set in second collection
175
- Lower similarity thresholds may significantly increase processing time and output size
176
177
### Optimization Tips
178
- Use higher similarity thresholds to reduce computation time
179
- For cross-collection processing, put the larger collection as the first input file (index)
180
- Use sampling (`--sample-k`) for initial exploration of large query collections
181
- Consider the trade-offs between different similarity functions based on your data characteristics
182
183
## Error Handling
184
185
The command-line tool will exit with error messages for:
186
187
- Invalid similarity function names
188
- Similarity thresholds outside [0, 1] range
189
- Missing or unreadable input files
190
- Invalid input file formats
191
- Insufficient memory for large datasets
192
193
Common error scenarios:
194
- Duplicate (SetID, Token) pairs in input files
195
- Mixed tuple formats within the same file
196
- Insufficient disk space for output files