A library for summarizing data in streams for which it is infeasible to store all events
npx @tessl/cli install tessl/maven-com-clearspring-analytics--stream@2.9.0Stream-lib is a Java library for summarizing data in streams where it is infeasible to store all events. It provides probabilistic data structures and algorithms for cardinality estimation (counting unique elements), frequency estimation (counting occurrences), quantile estimation (computing percentiles), set membership testing (Bloom filters), and top-K element tracking. A key feature is that cardinality estimators with compatible configurations can be safely merged, making it ideal for distributed computing scenarios.
<dependency>
<groupId>com.clearspring.analytics</groupId>
<artifactId>stream</artifactId>
<version>2.9.8</version>
</dependency>import com.clearspring.analytics.stream.cardinality.*;
import com.clearspring.analytics.stream.frequency.*;
import com.clearspring.analytics.stream.quantile.*;
import com.clearspring.analytics.stream.membership.*;
import com.clearspring.analytics.stream.*;import com.clearspring.analytics.stream.cardinality.HyperLogLog;
import com.clearspring.analytics.stream.frequency.CountMinSketch;
import com.clearspring.analytics.stream.StreamSummary;
// Cardinality estimation
HyperLogLog hll = new HyperLogLog(0.1); // 10% relative standard deviation
hll.offer("item1");
hll.offer("item2");
hll.offer("item1"); // duplicate
long uniqueCount = hll.cardinality(); // approximately 2
// Frequency estimation
CountMinSketch cms = new CountMinSketch(0.01, 0.99, 1); // 1% error, 99% confidence
cms.add("apple", 5);
cms.add("banana", 3);
long appleCount = cms.estimateCount("apple"); // approximately 5
// Top-K tracking
StreamSummary<String> topK = new StreamSummary<>(10); // track top 10
topK.offer("apple");
topK.offer("banana");
topK.offer("apple");
List<String> top3 = topK.peek(3); // ["apple", "banana"]Stream-lib is organized into several key functional areas:
Probabilistic algorithms for estimating the number of unique elements in a stream. All estimators can be merged if they have compatible configurations.
public interface ICardinality {
boolean offer(Object o);
boolean offerHashed(long hashedLong);
boolean offerHashed(int hashedInt);
long cardinality();
int sizeof();
byte[] getBytes() throws IOException;
ICardinality merge(ICardinality... estimators) throws CardinalityMergeException;
}Data structures for estimating the frequency of items in a stream with configurable error bounds and confidence levels.
public interface IFrequency {
void add(long item, long count);
void add(String item, long count);
long estimateCount(long item);
long estimateCount(String item);
long size();
}Bloom filters for probabilistic set membership testing with configurable false positive rates but no false negatives.
public class BloomFilter extends Filter {
public BloomFilter(int numElements, double maxFalsePosProbability);
public boolean isPresent(String key);
public boolean isPresent(byte[] key);
public void add(String key);
public void add(byte[] key);
public void addAll(BloomFilter other);
}Algorithms for computing quantiles and percentiles in streaming data with memory-efficient approximation techniques.
public interface IQuantileEstimator {
void offer(long value);
long getQuantile(double q);
}Algorithms for tracking the most frequent items and maintaining stream summaries with error bounds.
public interface ITopK<T> {
boolean offer(T element);
boolean offer(T element, int incrementCount);
List<T> peek(int k);
}Fast, well-distributed hash functions optimized for use in probabilistic data structures and general hash-based lookup.
public class MurmurHash {
public static int hash(Object o);
public static int hash(byte[] data, int seed);
public static long hash64(Object o);
public static long hash64(byte[] data, int length, int seed);
}public abstract class CardinalityMergeException extends Exception {
public CardinalityMergeException(String message);
}
public abstract class FrequencyMergeException extends Exception {
}public interface IBuilder<T> {
T build();
int sizeof();
}
public class Pair<A, B> {
public final A left;
public final B right;
public Pair(A left, B right);
}
public class Counter<T> implements Externalizable {
public T getItem();
public long getCount();
public long getError();
}