or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

bloom-filter.md count-min-sketch.md index.md serialization.md

tile.json

tessl/maven-org-apache-spark--spark-sketch_2-12

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-sketch_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-sketch_2-12@3.5.0

Spark Sketch

Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing. This library is part of Apache Spark but can be used independently for high-performance approximate computations.

Package Information

Package Name: spark-sketch_2.12
Package Type: Maven
Language: Java
Installation: org.apache.spark:spark-sketch_2.12:3.5.6

Core Imports

import org.apache.spark.util.sketch.BloomFilter;
import org.apache.spark.util.sketch.CountMinSketch;
import org.apache.spark.util.sketch.IncompatibleMergeException;

Basic Usage

import org.apache.spark.util.sketch.BloomFilter;
import org.apache.spark.util.sketch.CountMinSketch;

// Create and use a Bloom filter
BloomFilter bloomFilter = BloomFilter.create(1000, 0.03);
bloomFilter.put("example");
boolean mightContain = bloomFilter.mightContain("example"); // true
boolean definitelyNotContain = bloomFilter.mightContain("missing"); // false

// Create and use a Count-Min sketch
CountMinSketch sketch = CountMinSketch.create(0.01, 0.99, 42);
sketch.add("item1");
sketch.add("item2", 5);
long estimate = sketch.estimateCount("item1"); // returns 1
long estimateItem2 = sketch.estimateCount("item2"); // returns 5

Architecture

Spark Sketch implements two key probabilistic data structures:

Bloom Filters: Space-efficient probabilistic data structure for approximate set membership testing with configurable false positive rates
Count-Min Sketches: Probabilistic data structure for frequency estimation and cardinality counting with bounded error guarantees
Serialization Support: Both structures support binary serialization for distributed computing scenarios
Type Safety: Support for multiple data types (primitives, strings, byte arrays) with specialized methods for optimal performance
Merge Operations: Ability to combine multiple data structures for distributed aggregation

Capabilities

Bloom Filter Operations

Space-efficient approximate membership testing with configurable false positive probability. Ideal for duplicate detection and cache filtering in big data applications.

public static BloomFilter create(long expectedNumItems);
public static BloomFilter create(long expectedNumItems, double fpp);
public abstract boolean put(Object item);
public abstract boolean mightContain(Object item);
public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException;

Bloom Filters

Count-Min Sketch Operations

Probabilistic frequency estimation for streaming data with bounded error guarantees. Perfect for heavy hitters detection and approximate counting in large datasets.

public static CountMinSketch create(double eps, double confidence, int seed);
public static CountMinSketch create(int depth, int width, int seed);
public abstract void add(Object item);
public abstract void add(Object item, long count);
public abstract long estimateCount(Object item);
public abstract CountMinSketch mergeInPlace(CountMinSketch other) throws IncompatibleMergeException;

Count-Min Sketches

Serialization and I/O

Binary serialization support for distributed computing and persistent storage scenarios.

public abstract void writeTo(OutputStream out) throws IOException;
public static BloomFilter readFrom(InputStream in) throws IOException;
public static CountMinSketch readFrom(InputStream in) throws IOException;
public static CountMinSketch readFrom(byte[] bytes) throws IOException;

Serialization

Types

public class IncompatibleMergeException extends Exception {
    public IncompatibleMergeException(String message);
}

Supported Data Types

Both Bloom filters and Count-Min sketches support the following Java data types:

Byte, Short, Integer, Long
String
byte[] arrays
Generic Object with automatic type detection and conversion