0
# Frequency Estimation
1
2
Data structures for estimating the frequency of items in a data stream with configurable error bounds and confidence levels. These algorithms provide memory-efficient approximations for tracking item counts when exact counting would require too much memory.
3
4
## Capabilities
5
6
### IFrequency Interface
7
8
Common interface implemented by all frequency estimation algorithms.
9
10
```java { .api }
11
/**
12
* Interface for frequency estimation algorithms
13
*/
14
public interface IFrequency {
15
/**
16
* Add count for a long item
17
* @param item the item to count
18
* @param count number to add to the item's count
19
*/
20
void add(long item, long count);
21
22
/**
23
* Add count for a string item
24
* @param item the item to count
25
* @param count number to add to the item's count
26
*/
27
void add(String item, long count);
28
29
/**
30
* Estimate count for a long item
31
* @param item the item to query
32
* @return estimated count (may overestimate, never underestimates)
33
*/
34
long estimateCount(long item);
35
36
/**
37
* Estimate count for a string item
38
* @param item the item to query
39
* @return estimated count (may overestimate, never underestimates)
40
*/
41
long estimateCount(String item);
42
43
/**
44
* Get total size/count processed
45
* @return total number of items processed
46
*/
47
long size();
48
}
49
```
50
51
### CountMinSketch
52
53
Count-Min Sketch data structure for frequency estimation with configurable error bounds and confidence levels.
54
55
```java { .api }
56
/**
57
* Count-Min Sketch frequency estimator
58
*/
59
public class CountMinSketch implements IFrequency, Serializable {
60
public static final long PRIME_MODULUS = (1L << 31) - 1;
61
62
/**
63
* Create Count-Min Sketch with specified dimensions
64
* @param depth number of hash functions (affects confidence)
65
* @param width width of each hash table (affects accuracy)
66
* @param seed random seed for hash functions
67
*/
68
public CountMinSketch(int depth, int width, int seed);
69
70
/**
71
* Create Count-Min Sketch with error and confidence parameters
72
* @param epsOfTotalCount relative error as fraction of total count
73
* @param confidence confidence level (e.g., 0.99 for 99%)
74
* @param seed random seed for hash functions
75
*/
76
public CountMinSketch(double epsOfTotalCount, double confidence, int seed);
77
78
/**
79
* Get relative error bound
80
* @return relative error as fraction
81
*/
82
public double getRelativeError();
83
84
/**
85
* Get confidence level
86
* @return confidence level
87
*/
88
public double getConfidence();
89
90
/**
91
* Merge multiple Count-Min Sketches
92
* @param estimators sketches to merge (must have same dimensions)
93
* @return merged sketch
94
* @throws CMSMergeException if sketches are incompatible
95
*/
96
public static CountMinSketch merge(CountMinSketch... estimators) throws CMSMergeException;
97
98
/**
99
* Serialize sketch to byte array
100
* @param sketch sketch to serialize
101
* @return serialized bytes
102
* @throws IOException if serialization fails
103
*/
104
public static byte[] serialize(CountMinSketch sketch) throws IOException;
105
106
/**
107
* Deserialize sketch from byte array
108
* @param data serialized sketch data
109
* @return deserialized sketch
110
* @throws IOException if deserialization fails
111
*/
112
public static CountMinSketch deserialize(byte[] data) throws IOException;
113
114
/**
115
* Exception for Count-Min Sketch merge errors
116
*/
117
public static class CMSMergeException extends FrequencyMergeException {
118
public CMSMergeException(String message);
119
}
120
}
121
```
122
123
**Usage Examples:**
124
125
```java
126
import com.clearspring.analytics.stream.frequency.CountMinSketch;
127
128
// Create with 1% error, 99% confidence
129
CountMinSketch cms = new CountMinSketch(0.01, 0.99, 1234);
130
131
// Add items
132
cms.add("apple", 5);
133
cms.add("banana", 3);
134
cms.add("cherry", 7);
135
cms.add("apple", 2); // apple now has count ~7
136
137
// Query frequencies
138
long appleCount = cms.estimateCount("apple"); // returns >= 7
139
long bananaCount = cms.estimateCount("banana"); // returns >= 3
140
long unknownCount = cms.estimateCount("unknown"); // returns >= 0
141
142
// Total items processed
143
long totalSize = cms.size(); // returns 17
144
```
145
146
### ConservativeAddSketch
147
148
Conservative update variant of Count-Min Sketch that reduces overestimation by using minimum update strategy.
149
150
```java { .api }
151
/**
152
* Conservative Count-Min Sketch with reduced overestimation
153
*/
154
public class ConservativeAddSketch extends CountMinSketch {
155
/**
156
* Create Conservative Count-Min Sketch with specified dimensions
157
* @param depth number of hash functions
158
* @param width width of each hash table
159
* @param seed random seed for hash functions
160
*/
161
public ConservativeAddSketch(int depth, int width, int seed);
162
163
/**
164
* Create Conservative Count-Min Sketch with error and confidence parameters
165
* @param epsOfTotalCount relative error as fraction of total count
166
* @param confidence confidence level
167
* @param seed random seed for hash functions
168
*/
169
public ConservativeAddSketch(double epsOfTotalCount, double confidence, int seed);
170
}
171
```
172
173
### FrequencyMergeException
174
175
Base exception class for frequency estimation merge errors.
176
177
```java { .api }
178
/**
179
* Base exception for frequency estimation merge errors
180
*/
181
public abstract class FrequencyMergeException extends Exception {
182
public FrequencyMergeException();
183
public FrequencyMergeException(String message);
184
}
185
```
186
187
## Usage Patterns
188
189
### Basic Frequency Tracking
190
191
```java
192
// Create sketch with desired accuracy
193
CountMinSketch cms = new CountMinSketch(0.01, 0.99, 42);
194
195
// Process stream data
196
cms.add("user123", 1);
197
cms.add("user456", 1);
198
cms.add("user123", 1); // user123 now has count 2
199
200
// Query frequencies
201
long user123Count = cms.estimateCount("user123"); // >= 2
202
long user456Count = cms.estimateCount("user456"); // >= 1
203
```
204
205
### Heavy Hitters Detection
206
207
```java
208
CountMinSketch cms = new CountMinSketch(0.001, 0.999, 1);
209
long threshold = 1000; // define heavy hitter threshold
210
211
// Process data stream
212
for (String item : streamData) {
213
cms.add(item, 1);
214
215
// Check if item is a heavy hitter
216
if (cms.estimateCount(item) >= threshold) {
217
System.out.println(item + " is a heavy hitter");
218
}
219
}
220
```
221
222
### Distributed Frequency Estimation
223
224
```java
225
// Create compatible sketches on different nodes
226
CountMinSketch cms1 = new CountMinSketch(0.01, 0.99, 123);
227
CountMinSketch cms2 = new CountMinSketch(0.01, 0.99, 123);
228
229
// Process data on each node
230
cms1.add("item1", 10);
231
cms1.add("item2", 5);
232
233
cms2.add("item1", 8);
234
cms2.add("item3", 3);
235
236
// Merge sketches
237
CountMinSketch merged = CountMinSketch.merge(cms1, cms2);
238
239
// Query combined frequencies
240
long item1Count = merged.estimateCount("item1"); // >= 18
241
long item2Count = merged.estimateCount("item2"); // >= 5
242
long item3Count = merged.estimateCount("item3"); // >= 3
243
```
244
245
### Serialization for Persistence
246
247
```java
248
CountMinSketch cms = new CountMinSketch(0.01, 0.99, 456);
249
250
// Add data
251
cms.add("data1", 100);
252
cms.add("data2", 200);
253
254
// Serialize to bytes
255
byte[] serialized = CountMinSketch.serialize(cms);
256
257
// Later, deserialize
258
CountMinSketch restored = CountMinSketch.deserialize(serialized);
259
260
// Verify data is preserved
261
long data1Count = restored.estimateCount("data1"); // >= 100
262
```
263
264
### Parameter Selection Guidelines
265
266
**Depth (d)**: Controls confidence level
267
- Higher depth = higher confidence but more memory usage
268
- d = ceil(ln(1/δ)) where δ is failure probability
269
- For 99% confidence: d = 5, for 99.9% confidence: d = 7
270
271
**Width (w)**: Controls accuracy
272
- Higher width = better accuracy but more memory usage
273
- w = ceil(e/ε) where ε is relative error
274
- For 1% error: w = 272, for 0.1% error: w = 2719
275
276
**Conservative vs Standard**:
277
- Use ConservativeAddSketch when overestimation must be minimized
278
- Standard CountMinSketch is faster but may overestimate more
279
- Conservative variant is better for applications requiring tight bounds
280
281
## Memory Usage
282
283
Count-Min Sketch memory usage: `depth × width × 8 bytes`
284
285
Examples:
286
- 1% error, 99% confidence: ~11 KB
287
- 0.1% error, 99.9% confidence: ~150 KB
288
- 0.01% error, 99.99% confidence: ~2.1 MB