or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

all-pairs-search.mdcommand-line.mdindex.mdquery-search.md

command-line.mddocs/

0

# Command Line Tool

1

2

The SetSimilaritySearch package includes a command-line script `all_pairs.py` for batch processing of set similarity operations. This tool is useful for processing large datasets stored in files without writing custom Python code.

3

4

## Capabilities

5

6

### All-Pairs Command Line Interface

7

8

Process set similarity operations from command line with file input/output support.

9

10

```bash { .api }

11

all_pairs.py --input-sets FILE [FILE] \

12

--output-pairs OUTPUT \

13

--similarity-func FUNC \

14

--similarity-threshold THRESHOLD \

15

[--reversed-tuple BOOL] \

16

[--sample-k INT]

17

18

# Parameters:

19

# --input-sets: Input file(s) with SetID-Token pairs

20

# - One file: Computes all-pairs within the collection (self-join)

21

# - Two files: Computes cross-collection pairs (join between collections)

22

# --output-pairs: Output CSV file path for results

23

# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)

24

# --similarity-threshold: Similarity threshold (float between 0 and 1)

25

# --reversed-tuple: Whether input format is "Token SetID" instead of "SetID Token" (default: false)

26

# --sample-k: Number of sets to sample from second file for queries (default: use all sets)

27

```

28

29

## Usage Examples

30

31

### Self All-Pairs Processing

32

33

Find all similar pairs within a single collection:

34

35

```bash

36

# Input file format (example_sets.txt):

37

# Each line: SetID Token

38

# Lines starting with # are ignored

39

1 apple

40

1 banana

41

1 cherry

42

2 banana

43

2 cherry

44

2 date

45

3 apple

46

3 elderberry

47

48

# Run all-pairs search

49

all_pairs.py --input-sets example_sets.txt \

50

--output-pairs results.csv \

51

--similarity-func jaccard \

52

--similarity-threshold 0.3

53

```

54

55

Output CSV format:

56

```csv

57

set_ID_x,set_ID_y,set_size_x,set_size_y,similarity

58

2,1,3,3,0.500

59

3,1,2,3,0.200

60

```

61

62

### Cross-Collection Processing

63

64

Find similar pairs between two different collections:

65

66

```bash

67

# Collection 1 (documents.txt)

68

doc1 word1

69

doc1 word2

70

doc1 word3

71

doc2 word2

72

doc2 word4

73

74

# Collection 2 (queries.txt)

75

query1 word1

76

query1 word2

77

query2 word3

78

query2 word4

79

80

# Find cross-collection similarities

81

all_pairs.py --input-sets documents.txt queries.txt \

82

--output-pairs cross_results.csv \

83

--similarity-func cosine \

84

--similarity-threshold 0.1

85

```

86

87

### Large Dataset Processing with Sampling

88

89

Process large datasets efficiently by sampling queries:

90

91

```bash

92

# Process only 1000 sampled queries from the second collection

93

all_pairs.py --input-sets large_index.txt large_queries.txt \

94

--output-pairs sampled_results.csv \

95

--similarity-func jaccard \

96

--similarity-threshold 0.5 \

97

--sample-k 1000

98

```

99

100

### Reversed Tuple Format

101

102

Handle input files where tokens come before set IDs:

103

104

```bash

105

# Input format: Token SetID (instead of SetID Token)

106

apple doc1

107

banana doc1

108

cherry doc1

109

banana doc2

110

111

all_pairs.py --input-sets reversed_format.txt \

112

--output-pairs results.csv \

113

--similarity-func jaccard \

114

--similarity-threshold 0.2 \

115

--reversed-tuple true

116

```

117

118

## Input File Format

119

120

The input files must follow a specific format:

121

122

```

123

# Comments start with # and are ignored

124

# Each line contains exactly two space/tab-separated values

125

# Format: SetID Token (or Token SetID if --reversed-tuple is used)

126

# Every line must be unique

127

128

1 apple

129

1 banana

130

1 cherry

131

2 banana

132

2 cherry

133

2 date

134

3 apple

135

3 elderberry

136

```

137

138

Requirements:

139

- Each line represents one element belonging to one set

140

- SetID and Token are separated by whitespace (space or tab)

141

- All (SetID, Token) pairs must be unique

142

- Lines starting with `#` are treated as comments and ignored

143

- Empty lines are ignored

144

145

## Output Format

146

147

The command-line tool outputs results in CSV format with the following columns:

148

149

| Column | Description |

150

|--------|-------------|

151

| `set_ID_x` | ID of the first set in the pair |

152

| `set_ID_y` | ID of the second set in the pair |

153

| `set_size_x` | Number of elements in the first set |

154

| `set_size_y` | Number of elements in the second set |

155

| `similarity` | Computed similarity value |

156

157

Example output:

158

```csv

159

set_ID_x,set_ID_y,set_size_x,set_size_y,similarity

160

doc2,doc1,3,4,0.500

161

doc3,doc1,2,4,0.200

162

doc3,doc2,2,3,0.333

163

```

164

165

## Performance Considerations

166

167

### Memory Usage

168

- The tool loads entire collections into memory

169

- For very large datasets, consider splitting into smaller batches

170

- Memory usage scales with vocabulary size and collection size

171

172

### Processing Time

173

- Self all-pairs: Processing time depends on collection size and similarity threshold

174

- Cross-collection: Indexing time + query time for each set in second collection

175

- Lower similarity thresholds may significantly increase processing time and output size

176

177

### Optimization Tips

178

- Use higher similarity thresholds to reduce computation time

179

- For cross-collection processing, put the larger collection as the first input file (index)

180

- Use sampling (`--sample-k`) for initial exploration of large query collections

181

- Consider the trade-offs between different similarity functions based on your data characteristics

182

183

## Error Handling

184

185

The command-line tool will exit with error messages for:

186

187

- Invalid similarity function names

188

- Similarity thresholds outside [0, 1] range

189

- Missing or unreadable input files

190

- Invalid input file formats

191

- Insufficient memory for large datasets

192

193

Common error scenarios:

194

- Duplicate (SetID, Token) pairs in input files

195

- Mixed tuple formats within the same file

196

- Insufficient disk space for output files