or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cardinality.mdfrequency.mdhash.mdindex.mdmembership.mdquantile.mdstream-summary.md

frequency.mddocs/

0

# Frequency Estimation

1

2

Data structures for estimating the frequency of items in a data stream with configurable error bounds and confidence levels. These algorithms provide memory-efficient approximations for tracking item counts when exact counting would require too much memory.

3

4

## Capabilities

5

6

### IFrequency Interface

7

8

Common interface implemented by all frequency estimation algorithms.

9

10

```java { .api }

11

/**

12

* Interface for frequency estimation algorithms

13

*/

14

public interface IFrequency {

15

/**

16

* Add count for a long item

17

* @param item the item to count

18

* @param count number to add to the item's count

19

*/

20

void add(long item, long count);

21

22

/**

23

* Add count for a string item

24

* @param item the item to count

25

* @param count number to add to the item's count

26

*/

27

void add(String item, long count);

28

29

/**

30

* Estimate count for a long item

31

* @param item the item to query

32

* @return estimated count (may overestimate, never underestimates)

33

*/

34

long estimateCount(long item);

35

36

/**

37

* Estimate count for a string item

38

* @param item the item to query

39

* @return estimated count (may overestimate, never underestimates)

40

*/

41

long estimateCount(String item);

42

43

/**

44

* Get total size/count processed

45

* @return total number of items processed

46

*/

47

long size();

48

}

49

```

50

51

### CountMinSketch

52

53

Count-Min Sketch data structure for frequency estimation with configurable error bounds and confidence levels.

54

55

```java { .api }

56

/**

57

* Count-Min Sketch frequency estimator

58

*/

59

public class CountMinSketch implements IFrequency, Serializable {

60

public static final long PRIME_MODULUS = (1L << 31) - 1;

61

62

/**

63

* Create Count-Min Sketch with specified dimensions

64

* @param depth number of hash functions (affects confidence)

65

* @param width width of each hash table (affects accuracy)

66

* @param seed random seed for hash functions

67

*/

68

public CountMinSketch(int depth, int width, int seed);

69

70

/**

71

* Create Count-Min Sketch with error and confidence parameters

72

* @param epsOfTotalCount relative error as fraction of total count

73

* @param confidence confidence level (e.g., 0.99 for 99%)

74

* @param seed random seed for hash functions

75

*/

76

public CountMinSketch(double epsOfTotalCount, double confidence, int seed);

77

78

/**

79

* Get relative error bound

80

* @return relative error as fraction

81

*/

82

public double getRelativeError();

83

84

/**

85

* Get confidence level

86

* @return confidence level

87

*/

88

public double getConfidence();

89

90

/**

91

* Merge multiple Count-Min Sketches

92

* @param estimators sketches to merge (must have same dimensions)

93

* @return merged sketch

94

* @throws CMSMergeException if sketches are incompatible

95

*/

96

public static CountMinSketch merge(CountMinSketch... estimators) throws CMSMergeException;

97

98

/**

99

* Serialize sketch to byte array

100

* @param sketch sketch to serialize

101

* @return serialized bytes

102

* @throws IOException if serialization fails

103

*/

104

public static byte[] serialize(CountMinSketch sketch) throws IOException;

105

106

/**

107

* Deserialize sketch from byte array

108

* @param data serialized sketch data

109

* @return deserialized sketch

110

* @throws IOException if deserialization fails

111

*/

112

public static CountMinSketch deserialize(byte[] data) throws IOException;

113

114

/**

115

* Exception for Count-Min Sketch merge errors

116

*/

117

public static class CMSMergeException extends FrequencyMergeException {

118

public CMSMergeException(String message);

119

}

120

}

121

```

122

123

**Usage Examples:**

124

125

```java

126

import com.clearspring.analytics.stream.frequency.CountMinSketch;

127

128

// Create with 1% error, 99% confidence

129

CountMinSketch cms = new CountMinSketch(0.01, 0.99, 1234);

130

131

// Add items

132

cms.add("apple", 5);

133

cms.add("banana", 3);

134

cms.add("cherry", 7);

135

cms.add("apple", 2); // apple now has count ~7

136

137

// Query frequencies

138

long appleCount = cms.estimateCount("apple"); // returns >= 7

139

long bananaCount = cms.estimateCount("banana"); // returns >= 3

140

long unknownCount = cms.estimateCount("unknown"); // returns >= 0

141

142

// Total items processed

143

long totalSize = cms.size(); // returns 17

144

```

145

146

### ConservativeAddSketch

147

148

Conservative update variant of Count-Min Sketch that reduces overestimation by using minimum update strategy.

149

150

```java { .api }

151

/**

152

* Conservative Count-Min Sketch with reduced overestimation

153

*/

154

public class ConservativeAddSketch extends CountMinSketch {

155

/**

156

* Create Conservative Count-Min Sketch with specified dimensions

157

* @param depth number of hash functions

158

* @param width width of each hash table

159

* @param seed random seed for hash functions

160

*/

161

public ConservativeAddSketch(int depth, int width, int seed);

162

163

/**

164

* Create Conservative Count-Min Sketch with error and confidence parameters

165

* @param epsOfTotalCount relative error as fraction of total count

166

* @param confidence confidence level

167

* @param seed random seed for hash functions

168

*/

169

public ConservativeAddSketch(double epsOfTotalCount, double confidence, int seed);

170

}

171

```

172

173

### FrequencyMergeException

174

175

Base exception class for frequency estimation merge errors.

176

177

```java { .api }

178

/**

179

* Base exception for frequency estimation merge errors

180

*/

181

public abstract class FrequencyMergeException extends Exception {

182

public FrequencyMergeException();

183

public FrequencyMergeException(String message);

184

}

185

```

186

187

## Usage Patterns

188

189

### Basic Frequency Tracking

190

191

```java

192

// Create sketch with desired accuracy

193

CountMinSketch cms = new CountMinSketch(0.01, 0.99, 42);

194

195

// Process stream data

196

cms.add("user123", 1);

197

cms.add("user456", 1);

198

cms.add("user123", 1); // user123 now has count 2

199

200

// Query frequencies

201

long user123Count = cms.estimateCount("user123"); // >= 2

202

long user456Count = cms.estimateCount("user456"); // >= 1

203

```

204

205

### Heavy Hitters Detection

206

207

```java

208

CountMinSketch cms = new CountMinSketch(0.001, 0.999, 1);

209

long threshold = 1000; // define heavy hitter threshold

210

211

// Process data stream

212

for (String item : streamData) {

213

cms.add(item, 1);

214

215

// Check if item is a heavy hitter

216

if (cms.estimateCount(item) >= threshold) {

217

System.out.println(item + " is a heavy hitter");

218

}

219

}

220

```

221

222

### Distributed Frequency Estimation

223

224

```java

225

// Create compatible sketches on different nodes

226

CountMinSketch cms1 = new CountMinSketch(0.01, 0.99, 123);

227

CountMinSketch cms2 = new CountMinSketch(0.01, 0.99, 123);

228

229

// Process data on each node

230

cms1.add("item1", 10);

231

cms1.add("item2", 5);

232

233

cms2.add("item1", 8);

234

cms2.add("item3", 3);

235

236

// Merge sketches

237

CountMinSketch merged = CountMinSketch.merge(cms1, cms2);

238

239

// Query combined frequencies

240

long item1Count = merged.estimateCount("item1"); // >= 18

241

long item2Count = merged.estimateCount("item2"); // >= 5

242

long item3Count = merged.estimateCount("item3"); // >= 3

243

```

244

245

### Serialization for Persistence

246

247

```java

248

CountMinSketch cms = new CountMinSketch(0.01, 0.99, 456);

249

250

// Add data

251

cms.add("data1", 100);

252

cms.add("data2", 200);

253

254

// Serialize to bytes

255

byte[] serialized = CountMinSketch.serialize(cms);

256

257

// Later, deserialize

258

CountMinSketch restored = CountMinSketch.deserialize(serialized);

259

260

// Verify data is preserved

261

long data1Count = restored.estimateCount("data1"); // >= 100

262

```

263

264

### Parameter Selection Guidelines

265

266

**Depth (d)**: Controls confidence level

267

- Higher depth = higher confidence but more memory usage

268

- d = ceil(ln(1/δ)) where δ is failure probability

269

- For 99% confidence: d = 5, for 99.9% confidence: d = 7

270

271

**Width (w)**: Controls accuracy

272

- Higher width = better accuracy but more memory usage

273

- w = ceil(e/ε) where ε is relative error

274

- For 1% error: w = 272, for 0.1% error: w = 2719

275

276

**Conservative vs Standard**:

277

- Use ConservativeAddSketch when overestimation must be minimized

278

- Standard CountMinSketch is faster but may overestimate more

279

- Conservative variant is better for applications requiring tight bounds

280

281

## Memory Usage

282

283

Count-Min Sketch memory usage: `depth × width × 8 bytes`

284

285

Examples:

286

- 1% error, 99% confidence: ~11 KB

287

- 0.1% error, 99.9% confidence: ~150 KB

288

- 0.01% error, 99.99% confidence: ~2.1 MB