Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing
npx @tessl/cli install tessl/maven-org-apache-spark--spark-sketch_2-12@3.5.0Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing. This library is part of Apache Spark but can be used independently for high-performance approximate computations.
org.apache.spark:spark-sketch_2.12:3.5.6import org.apache.spark.util.sketch.BloomFilter;
import org.apache.spark.util.sketch.CountMinSketch;
import org.apache.spark.util.sketch.IncompatibleMergeException;import org.apache.spark.util.sketch.BloomFilter;
import org.apache.spark.util.sketch.CountMinSketch;
// Create and use a Bloom filter
BloomFilter bloomFilter = BloomFilter.create(1000, 0.03);
bloomFilter.put("example");
boolean mightContain = bloomFilter.mightContain("example"); // true
boolean definitelyNotContain = bloomFilter.mightContain("missing"); // false
// Create and use a Count-Min sketch
CountMinSketch sketch = CountMinSketch.create(0.01, 0.99, 42);
sketch.add("item1");
sketch.add("item2", 5);
long estimate = sketch.estimateCount("item1"); // returns 1
long estimateItem2 = sketch.estimateCount("item2"); // returns 5Spark Sketch implements two key probabilistic data structures:
Space-efficient approximate membership testing with configurable false positive probability. Ideal for duplicate detection and cache filtering in big data applications.
public static BloomFilter create(long expectedNumItems);
public static BloomFilter create(long expectedNumItems, double fpp);
public abstract boolean put(Object item);
public abstract boolean mightContain(Object item);
public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException;Probabilistic frequency estimation for streaming data with bounded error guarantees. Perfect for heavy hitters detection and approximate counting in large datasets.
public static CountMinSketch create(double eps, double confidence, int seed);
public static CountMinSketch create(int depth, int width, int seed);
public abstract void add(Object item);
public abstract void add(Object item, long count);
public abstract long estimateCount(Object item);
public abstract CountMinSketch mergeInPlace(CountMinSketch other) throws IncompatibleMergeException;Binary serialization support for distributed computing and persistent storage scenarios.
public abstract void writeTo(OutputStream out) throws IOException;
public static BloomFilter readFrom(InputStream in) throws IOException;
public static CountMinSketch readFrom(InputStream in) throws IOException;
public static CountMinSketch readFrom(byte[] bytes) throws IOException;public class IncompatibleMergeException extends Exception {
public IncompatibleMergeException(String message);
}Both Bloom filters and Count-Min sketches support the following Java data types:
Byte, Short, Integer, LongStringbyte[] arraysObject with automatic type detection and conversion