Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing
npx @tessl/cli install tessl/maven-org-apache-spark--spark-sketch_2-12@3.5.00
# Spark Sketch
1
2
Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing. This library is part of Apache Spark but can be used independently for high-performance approximate computations.
3
4
## Package Information
5
6
- **Package Name**: spark-sketch_2.12
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Installation**: `org.apache.spark:spark-sketch_2.12:3.5.6`
10
11
## Core Imports
12
13
```java
14
import org.apache.spark.util.sketch.BloomFilter;
15
import org.apache.spark.util.sketch.CountMinSketch;
16
import org.apache.spark.util.sketch.IncompatibleMergeException;
17
```
18
19
## Basic Usage
20
21
```java
22
import org.apache.spark.util.sketch.BloomFilter;
23
import org.apache.spark.util.sketch.CountMinSketch;
24
25
// Create and use a Bloom filter
26
BloomFilter bloomFilter = BloomFilter.create(1000, 0.03);
27
bloomFilter.put("example");
28
boolean mightContain = bloomFilter.mightContain("example"); // true
29
boolean definitelyNotContain = bloomFilter.mightContain("missing"); // false
30
31
// Create and use a Count-Min sketch
32
CountMinSketch sketch = CountMinSketch.create(0.01, 0.99, 42);
33
sketch.add("item1");
34
sketch.add("item2", 5);
35
long estimate = sketch.estimateCount("item1"); // returns 1
36
long estimateItem2 = sketch.estimateCount("item2"); // returns 5
37
```
38
39
## Architecture
40
41
Spark Sketch implements two key probabilistic data structures:
42
43
- **Bloom Filters**: Space-efficient probabilistic data structure for approximate set membership testing with configurable false positive rates
44
- **Count-Min Sketches**: Probabilistic data structure for frequency estimation and cardinality counting with bounded error guarantees
45
- **Serialization Support**: Both structures support binary serialization for distributed computing scenarios
46
- **Type Safety**: Support for multiple data types (primitives, strings, byte arrays) with specialized methods for optimal performance
47
- **Merge Operations**: Ability to combine multiple data structures for distributed aggregation
48
49
## Capabilities
50
51
### Bloom Filter Operations
52
53
Space-efficient approximate membership testing with configurable false positive probability. Ideal for duplicate detection and cache filtering in big data applications.
54
55
```java { .api }
56
public static BloomFilter create(long expectedNumItems);
57
public static BloomFilter create(long expectedNumItems, double fpp);
58
public abstract boolean put(Object item);
59
public abstract boolean mightContain(Object item);
60
public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException;
61
```
62
63
[Bloom Filters](./bloom-filter.md)
64
65
### Count-Min Sketch Operations
66
67
Probabilistic frequency estimation for streaming data with bounded error guarantees. Perfect for heavy hitters detection and approximate counting in large datasets.
68
69
```java { .api }
70
public static CountMinSketch create(double eps, double confidence, int seed);
71
public static CountMinSketch create(int depth, int width, int seed);
72
public abstract void add(Object item);
73
public abstract void add(Object item, long count);
74
public abstract long estimateCount(Object item);
75
public abstract CountMinSketch mergeInPlace(CountMinSketch other) throws IncompatibleMergeException;
76
```
77
78
[Count-Min Sketches](./count-min-sketch.md)
79
80
### Serialization and I/O
81
82
Binary serialization support for distributed computing and persistent storage scenarios.
83
84
```java { .api }
85
public abstract void writeTo(OutputStream out) throws IOException;
86
public static BloomFilter readFrom(InputStream in) throws IOException;
87
public static CountMinSketch readFrom(InputStream in) throws IOException;
88
public static CountMinSketch readFrom(byte[] bytes) throws IOException;
89
```
90
91
[Serialization](./serialization.md)
92
93
## Types
94
95
```java { .api }
96
public class IncompatibleMergeException extends Exception {
97
public IncompatibleMergeException(String message);
98
}
99
```
100
101
## Supported Data Types
102
103
Both Bloom filters and Count-Min sketches support the following Java data types:
104
- `Byte`, `Short`, `Integer`, `Long`
105
- `String`
106
- `byte[]` arrays
107
- Generic `Object` with automatic type detection and conversion