or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-sketch_2-12

Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-sketch_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-sketch_2-12@3.5.0

0

# Spark Sketch

1

2

Probabilistic data structures library providing space-efficient implementations of Bloom filters and Count-Min sketches for approximate membership testing and frequency estimation in large-scale data processing. This library is part of Apache Spark but can be used independently for high-performance approximate computations.

3

4

## Package Information

5

6

- **Package Name**: spark-sketch_2.12

7

- **Package Type**: Maven

8

- **Language**: Java

9

- **Installation**: `org.apache.spark:spark-sketch_2.12:3.5.6`

10

11

## Core Imports

12

13

```java

14

import org.apache.spark.util.sketch.BloomFilter;

15

import org.apache.spark.util.sketch.CountMinSketch;

16

import org.apache.spark.util.sketch.IncompatibleMergeException;

17

```

18

19

## Basic Usage

20

21

```java

22

import org.apache.spark.util.sketch.BloomFilter;

23

import org.apache.spark.util.sketch.CountMinSketch;

24

25

// Create and use a Bloom filter

26

BloomFilter bloomFilter = BloomFilter.create(1000, 0.03);

27

bloomFilter.put("example");

28

boolean mightContain = bloomFilter.mightContain("example"); // true

29

boolean definitelyNotContain = bloomFilter.mightContain("missing"); // false

30

31

// Create and use a Count-Min sketch

32

CountMinSketch sketch = CountMinSketch.create(0.01, 0.99, 42);

33

sketch.add("item1");

34

sketch.add("item2", 5);

35

long estimate = sketch.estimateCount("item1"); // returns 1

36

long estimateItem2 = sketch.estimateCount("item2"); // returns 5

37

```

38

39

## Architecture

40

41

Spark Sketch implements two key probabilistic data structures:

42

43

- **Bloom Filters**: Space-efficient probabilistic data structure for approximate set membership testing with configurable false positive rates

44

- **Count-Min Sketches**: Probabilistic data structure for frequency estimation and cardinality counting with bounded error guarantees

45

- **Serialization Support**: Both structures support binary serialization for distributed computing scenarios

46

- **Type Safety**: Support for multiple data types (primitives, strings, byte arrays) with specialized methods for optimal performance

47

- **Merge Operations**: Ability to combine multiple data structures for distributed aggregation

48

49

## Capabilities

50

51

### Bloom Filter Operations

52

53

Space-efficient approximate membership testing with configurable false positive probability. Ideal for duplicate detection and cache filtering in big data applications.

54

55

```java { .api }

56

public static BloomFilter create(long expectedNumItems);

57

public static BloomFilter create(long expectedNumItems, double fpp);

58

public abstract boolean put(Object item);

59

public abstract boolean mightContain(Object item);

60

public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException;

61

```

62

63

[Bloom Filters](./bloom-filter.md)

64

65

### Count-Min Sketch Operations

66

67

Probabilistic frequency estimation for streaming data with bounded error guarantees. Perfect for heavy hitters detection and approximate counting in large datasets.

68

69

```java { .api }

70

public static CountMinSketch create(double eps, double confidence, int seed);

71

public static CountMinSketch create(int depth, int width, int seed);

72

public abstract void add(Object item);

73

public abstract void add(Object item, long count);

74

public abstract long estimateCount(Object item);

75

public abstract CountMinSketch mergeInPlace(CountMinSketch other) throws IncompatibleMergeException;

76

```

77

78

[Count-Min Sketches](./count-min-sketch.md)

79

80

### Serialization and I/O

81

82

Binary serialization support for distributed computing and persistent storage scenarios.

83

84

```java { .api }

85

public abstract void writeTo(OutputStream out) throws IOException;

86

public static BloomFilter readFrom(InputStream in) throws IOException;

87

public static CountMinSketch readFrom(InputStream in) throws IOException;

88

public static CountMinSketch readFrom(byte[] bytes) throws IOException;

89

```

90

91

[Serialization](./serialization.md)

92

93

## Types

94

95

```java { .api }

96

public class IncompatibleMergeException extends Exception {

97

public IncompatibleMergeException(String message);

98

}

99

```

100

101

## Supported Data Types

102

103

Both Bloom filters and Count-Min sketches support the following Java data types:

104

- `Byte`, `Short`, `Integer`, `Long`

105

- `String`

106

- `byte[]` arrays

107

- Generic `Object` with automatic type detection and conversion