A library for summarizing data in streams for which it is infeasible to store all events
npx @tessl/cli install tessl/maven-com-clearspring-analytics--stream@2.9.00
# Stream-lib
1
2
Stream-lib is a Java library for summarizing data in streams where it is infeasible to store all events. It provides probabilistic data structures and algorithms for cardinality estimation (counting unique elements), frequency estimation (counting occurrences), quantile estimation (computing percentiles), set membership testing (Bloom filters), and top-K element tracking. A key feature is that cardinality estimators with compatible configurations can be safely merged, making it ideal for distributed computing scenarios.
3
4
## Package Information
5
6
- **Package Name**: stream
7
- **Package Type**: Maven
8
- **Group ID**: com.clearspring.analytics
9
- **Artifact ID**: stream
10
- **Language**: Java
11
- **Installation**:
12
```xml
13
<dependency>
14
<groupId>com.clearspring.analytics</groupId>
15
<artifactId>stream</artifactId>
16
<version>2.9.8</version>
17
</dependency>
18
```
19
20
## Core Imports
21
22
```java
23
import com.clearspring.analytics.stream.cardinality.*;
24
import com.clearspring.analytics.stream.frequency.*;
25
import com.clearspring.analytics.stream.quantile.*;
26
import com.clearspring.analytics.stream.membership.*;
27
import com.clearspring.analytics.stream.*;
28
```
29
30
## Basic Usage
31
32
```java
33
import com.clearspring.analytics.stream.cardinality.HyperLogLog;
34
import com.clearspring.analytics.stream.frequency.CountMinSketch;
35
import com.clearspring.analytics.stream.StreamSummary;
36
37
// Cardinality estimation
38
HyperLogLog hll = new HyperLogLog(0.1); // 10% relative standard deviation
39
hll.offer("item1");
40
hll.offer("item2");
41
hll.offer("item1"); // duplicate
42
long uniqueCount = hll.cardinality(); // approximately 2
43
44
// Frequency estimation
45
CountMinSketch cms = new CountMinSketch(0.01, 0.99, 1); // 1% error, 99% confidence
46
cms.add("apple", 5);
47
cms.add("banana", 3);
48
long appleCount = cms.estimateCount("apple"); // approximately 5
49
50
// Top-K tracking
51
StreamSummary<String> topK = new StreamSummary<>(10); // track top 10
52
topK.offer("apple");
53
topK.offer("banana");
54
topK.offer("apple");
55
List<String> top3 = topK.peek(3); // ["apple", "banana"]
56
```
57
58
## Architecture
59
60
Stream-lib is organized into several key functional areas:
61
62
- **Hash Functions**: Fast, well-distributed hash functions (MurmurHash, Lookup3Hash) used by probabilistic data structures
63
- **Cardinality Estimation**: Algorithms for counting unique elements (HyperLogLog, LogLog, LinearCounting)
64
- **Frequency Estimation**: Data structures for tracking item frequencies (Count-Min Sketch, Conservative Add Sketch)
65
- **Set Membership**: Bloom filters for testing whether an element is in a set
66
- **Quantile Estimation**: Algorithms for computing percentiles and quantiles (QDigest, TDigest)
67
- **Stream Summarization**: Top-K tracking and stream summary algorithms
68
- **Utilities**: Supporting data structures and helper functions
69
70
## Capabilities
71
72
### Cardinality Estimation
73
74
Probabilistic algorithms for estimating the number of unique elements in a stream. All estimators can be merged if they have compatible configurations.
75
76
```java { .api }
77
public interface ICardinality {
78
boolean offer(Object o);
79
boolean offerHashed(long hashedLong);
80
boolean offerHashed(int hashedInt);
81
long cardinality();
82
int sizeof();
83
byte[] getBytes() throws IOException;
84
ICardinality merge(ICardinality... estimators) throws CardinalityMergeException;
85
}
86
```
87
88
[Cardinality Estimation](./cardinality.md)
89
90
### Frequency Estimation
91
92
Data structures for estimating the frequency of items in a stream with configurable error bounds and confidence levels.
93
94
```java { .api }
95
public interface IFrequency {
96
void add(long item, long count);
97
void add(String item, long count);
98
long estimateCount(long item);
99
long estimateCount(String item);
100
long size();
101
}
102
```
103
104
[Frequency Estimation](./frequency.md)
105
106
### Set Membership Testing
107
108
Bloom filters for probabilistic set membership testing with configurable false positive rates but no false negatives.
109
110
```java { .api }
111
public class BloomFilter extends Filter {
112
public BloomFilter(int numElements, double maxFalsePosProbability);
113
public boolean isPresent(String key);
114
public boolean isPresent(byte[] key);
115
public void add(String key);
116
public void add(byte[] key);
117
public void addAll(BloomFilter other);
118
}
119
```
120
121
[Membership Testing](./membership.md)
122
123
### Quantile Estimation
124
125
Algorithms for computing quantiles and percentiles in streaming data with memory-efficient approximation techniques.
126
127
```java { .api }
128
public interface IQuantileEstimator {
129
void offer(long value);
130
long getQuantile(double q);
131
}
132
```
133
134
[Quantile Estimation](./quantile.md)
135
136
### Stream Summarization and Top-K
137
138
Algorithms for tracking the most frequent items and maintaining stream summaries with error bounds.
139
140
```java { .api }
141
public interface ITopK<T> {
142
boolean offer(T element);
143
boolean offer(T element, int incrementCount);
144
List<T> peek(int k);
145
}
146
```
147
148
[Stream Summarization](./stream-summary.md)
149
150
### Hash Functions
151
152
Fast, well-distributed hash functions optimized for use in probabilistic data structures and general hash-based lookup.
153
154
```java { .api }
155
public class MurmurHash {
156
public static int hash(Object o);
157
public static int hash(byte[] data, int seed);
158
public static long hash64(Object o);
159
public static long hash64(byte[] data, int length, int seed);
160
}
161
```
162
163
[Hash Functions](./hash.md)
164
165
## Types
166
167
### Common Exceptions
168
169
```java { .api }
170
public abstract class CardinalityMergeException extends Exception {
171
public CardinalityMergeException(String message);
172
}
173
174
public abstract class FrequencyMergeException extends Exception {
175
}
176
```
177
178
### Utility Types
179
180
```java { .api }
181
public interface IBuilder<T> {
182
T build();
183
int sizeof();
184
}
185
186
public class Pair<A, B> {
187
public final A left;
188
public final B right;
189
190
public Pair(A left, B right);
191
}
192
193
public class Counter<T> implements Externalizable {
194
public T getItem();
195
public long getCount();
196
public long getError();
197
}
198
```