CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-set-similarity-search

A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

command-line.mddocs/

Command Line Tool

The SetSimilaritySearch package includes a command-line script all_pairs.py for batch processing of set similarity operations. This tool is useful for processing large datasets stored in files without writing custom Python code.

Capabilities

All-Pairs Command Line Interface

Process set similarity operations from command line with file input/output support.

all_pairs.py --input-sets FILE [FILE] \
             --output-pairs OUTPUT \
             --similarity-func FUNC \
             --similarity-threshold THRESHOLD \
             [--reversed-tuple BOOL] \
             [--sample-k INT]

# Parameters:
# --input-sets: Input file(s) with SetID-Token pairs
#   - One file: Computes all-pairs within the collection (self-join)
#   - Two files: Computes cross-collection pairs (join between collections)
# --output-pairs: Output CSV file path for results
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
# --similarity-threshold: Similarity threshold (float between 0 and 1)
# --reversed-tuple: Whether input format is "Token SetID" instead of "SetID Token" (default: false)
# --sample-k: Number of sets to sample from second file for queries (default: use all sets)

Usage Examples

Self All-Pairs Processing

Find all similar pairs within a single collection:

# Input file format (example_sets.txt):
# Each line: SetID Token
# Lines starting with # are ignored
1 apple
1 banana
1 cherry
2 banana
2 cherry
2 date
3 apple
3 elderberry

# Run all-pairs search
all_pairs.py --input-sets example_sets.txt \
             --output-pairs results.csv \
             --similarity-func jaccard \
             --similarity-threshold 0.3

Output CSV format:

set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
2,1,3,3,0.500
3,1,2,3,0.200

Cross-Collection Processing

Find similar pairs between two different collections:

# Collection 1 (documents.txt)
doc1 word1
doc1 word2
doc1 word3
doc2 word2
doc2 word4

# Collection 2 (queries.txt)  
query1 word1
query1 word2
query2 word3
query2 word4

# Find cross-collection similarities
all_pairs.py --input-sets documents.txt queries.txt \
             --output-pairs cross_results.csv \
             --similarity-func cosine \
             --similarity-threshold 0.1

Large Dataset Processing with Sampling

Process large datasets efficiently by sampling queries:

# Process only 1000 sampled queries from the second collection
all_pairs.py --input-sets large_index.txt large_queries.txt \
             --output-pairs sampled_results.csv \
             --similarity-func jaccard \
             --similarity-threshold 0.5 \
             --sample-k 1000

Reversed Tuple Format

Handle input files where tokens come before set IDs:

# Input format: Token SetID (instead of SetID Token)
apple doc1
banana doc1
cherry doc1
banana doc2

all_pairs.py --input-sets reversed_format.txt \
             --output-pairs results.csv \
             --similarity-func jaccard \
             --similarity-threshold 0.2 \
             --reversed-tuple true

Input File Format

The input files must follow a specific format:

# Comments start with # and are ignored
# Each line contains exactly two space/tab-separated values
# Format: SetID Token (or Token SetID if --reversed-tuple is used)
# Every line must be unique

1 apple
1 banana  
1 cherry
2 banana
2 cherry
2 date
3 apple
3 elderberry

Requirements:

  • Each line represents one element belonging to one set
  • SetID and Token are separated by whitespace (space or tab)
  • All (SetID, Token) pairs must be unique
  • Lines starting with # are treated as comments and ignored
  • Empty lines are ignored

Output Format

The command-line tool outputs results in CSV format with the following columns:

ColumnDescription
set_ID_xID of the first set in the pair
set_ID_yID of the second set in the pair
set_size_xNumber of elements in the first set
set_size_yNumber of elements in the second set
similarityComputed similarity value

Example output:

set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
doc2,doc1,3,4,0.500
doc3,doc1,2,4,0.200
doc3,doc2,2,3,0.333

Performance Considerations

Memory Usage

  • The tool loads entire collections into memory
  • For very large datasets, consider splitting into smaller batches
  • Memory usage scales with vocabulary size and collection size

Processing Time

  • Self all-pairs: Processing time depends on collection size and similarity threshold
  • Cross-collection: Indexing time + query time for each set in second collection
  • Lower similarity thresholds may significantly increase processing time and output size

Optimization Tips

  • Use higher similarity thresholds to reduce computation time
  • For cross-collection processing, put the larger collection as the first input file (index)
  • Use sampling (--sample-k) for initial exploration of large query collections
  • Consider the trade-offs between different similarity functions based on your data characteristics

Error Handling

The command-line tool will exit with error messages for:

  • Invalid similarity function names
  • Similarity thresholds outside [0, 1] range
  • Missing or unreadable input files
  • Invalid input file formats
  • Insufficient memory for large datasets

Common error scenarios:

  • Duplicate (SetID, Token) pairs in input files
  • Mixed tuple formats within the same file
  • Insufficient disk space for output files

Install with Tessl CLI

npx tessl i tessl/pypi-set-similarity-search

docs

all-pairs-search.md

command-line.md

index.md

query-search.md

tile.json