A Python library providing efficient algorithms for set similarity search operations with Jaccard, Cosine, and Containment similarity functions.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
The SetSimilaritySearch package includes a command-line script all_pairs.py for batch processing of set similarity operations. This tool is useful for processing large datasets stored in files without writing custom Python code.
Process set similarity operations from command line with file input/output support.
all_pairs.py --input-sets FILE [FILE] \
--output-pairs OUTPUT \
--similarity-func FUNC \
--similarity-threshold THRESHOLD \
[--reversed-tuple BOOL] \
[--sample-k INT]
# Parameters:
# --input-sets: Input file(s) with SetID-Token pairs
# - One file: Computes all-pairs within the collection (self-join)
# - Two files: Computes cross-collection pairs (join between collections)
# --output-pairs: Output CSV file path for results
# --similarity-func: Similarity function (jaccard, cosine, containment, containment_min)
# --similarity-threshold: Similarity threshold (float between 0 and 1)
# --reversed-tuple: Whether input format is "Token SetID" instead of "SetID Token" (default: false)
# --sample-k: Number of sets to sample from second file for queries (default: use all sets)Find all similar pairs within a single collection:
# Input file format (example_sets.txt):
# Each line: SetID Token
# Lines starting with # are ignored
1 apple
1 banana
1 cherry
2 banana
2 cherry
2 date
3 apple
3 elderberry
# Run all-pairs search
all_pairs.py --input-sets example_sets.txt \
--output-pairs results.csv \
--similarity-func jaccard \
--similarity-threshold 0.3Output CSV format:
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
2,1,3,3,0.500
3,1,2,3,0.200Find similar pairs between two different collections:
# Collection 1 (documents.txt)
doc1 word1
doc1 word2
doc1 word3
doc2 word2
doc2 word4
# Collection 2 (queries.txt)
query1 word1
query1 word2
query2 word3
query2 word4
# Find cross-collection similarities
all_pairs.py --input-sets documents.txt queries.txt \
--output-pairs cross_results.csv \
--similarity-func cosine \
--similarity-threshold 0.1Process large datasets efficiently by sampling queries:
# Process only 1000 sampled queries from the second collection
all_pairs.py --input-sets large_index.txt large_queries.txt \
--output-pairs sampled_results.csv \
--similarity-func jaccard \
--similarity-threshold 0.5 \
--sample-k 1000Handle input files where tokens come before set IDs:
# Input format: Token SetID (instead of SetID Token)
apple doc1
banana doc1
cherry doc1
banana doc2
all_pairs.py --input-sets reversed_format.txt \
--output-pairs results.csv \
--similarity-func jaccard \
--similarity-threshold 0.2 \
--reversed-tuple trueThe input files must follow a specific format:
# Comments start with # and are ignored
# Each line contains exactly two space/tab-separated values
# Format: SetID Token (or Token SetID if --reversed-tuple is used)
# Every line must be unique
1 apple
1 banana
1 cherry
2 banana
2 cherry
2 date
3 apple
3 elderberryRequirements:
# are treated as comments and ignoredThe command-line tool outputs results in CSV format with the following columns:
| Column | Description |
|---|---|
set_ID_x | ID of the first set in the pair |
set_ID_y | ID of the second set in the pair |
set_size_x | Number of elements in the first set |
set_size_y | Number of elements in the second set |
similarity | Computed similarity value |
Example output:
set_ID_x,set_ID_y,set_size_x,set_size_y,similarity
doc2,doc1,3,4,0.500
doc3,doc1,2,4,0.200
doc3,doc2,2,3,0.333--sample-k) for initial exploration of large query collectionsThe command-line tool will exit with error messages for:
Common error scenarios:
Install with Tessl CLI
npx tessl i tessl/pypi-set-similarity-search