Bloom filter: A Probabilistic data structure
npx @tessl/cli install tessl/pypi-pybloom-live@3.1.00
# Pybloom Live
1
2
A Python implementation of Bloom filter probabilistic data structures, forked from pybloom with an improved tightening ratio of 0.9 for better average space usage across wide ranges of growth. Provides both standard Bloom Filters for known set sizes and Scalable Bloom Filters that can grow dynamically without knowing the original set size.
3
4
## Package Information
5
6
- **Package Name**: pybloom-live
7
- **Language**: Python
8
- **Installation**: `pip install pybloom-live`
9
- **Dependencies**: bitarray>=0.3.4
10
11
## Core Imports
12
13
```python
14
import pybloom_live
15
```
16
17
Standard pattern for accessing classes:
18
19
```python
20
from pybloom_live import BloomFilter, ScalableBloomFilter
21
```
22
23
## Basic Usage
24
25
```python
26
from pybloom_live import BloomFilter, ScalableBloomFilter
27
28
# Create a standard bloom filter for known set size
29
bf = BloomFilter(capacity=1000, error_rate=0.001)
30
31
# Add items to filter
32
bf.add("item1")
33
bf.add("item2")
34
bf.add("item3")
35
36
# Test membership (may have false positives, never false negatives)
37
print("item1" in bf) # True
38
print("item4" in bf) # False (or rarely True - false positive)
39
40
# Create scalable bloom filter for unknown set size
41
sbf = ScalableBloomFilter(initial_capacity=100, error_rate=0.001)
42
for i in range(10000):
43
sbf.add(f"item{i}")
44
45
print(f"item5000" in sbf) # True
46
print(len(sbf)) # 10000
47
```
48
49
## Architecture
50
51
Pybloom-live implements two main probabilistic data structures:
52
53
- **BloomFilter**: Space-efficient probabilistic data structure for fixed-size sets with configurable false positive rate
54
- **ScalableBloomFilter**: Dynamic bloom filter that grows automatically by creating internal BloomFilter instances with exponentially increasing capacity
55
56
Both implementations use optimized hash function generation (MD5, SHA1, SHA256, SHA384, SHA512) based on required hash bits, and support set operations (union, intersection), serialization to files, and pickling for persistence.
57
58
## Capabilities
59
60
### Standard Bloom Filter
61
62
Fixed-capacity probabilistic data structure for set membership testing with configurable false positive probability. Ideal when the maximum set size is known in advance.
63
64
```python { .api }
65
class BloomFilter:
66
def __init__(self, capacity: int, error_rate: float = 0.001):
67
"""
68
Initialize bloom filter with fixed capacity.
69
70
Parameters:
71
- capacity: Maximum number of elements before degradation
72
- error_rate: Target false positive probability (0 < error_rate < 1)
73
74
Raises:
75
- ValueError: If error_rate not between 0 and 1, or capacity <= 0
76
"""
77
78
def add(self, key, skip_check: bool = False) -> bool:
79
"""
80
Add a key to the bloom filter.
81
82
Parameters:
83
- key: Item to add (string or any hashable object)
84
- skip_check: Skip membership check for performance
85
86
Returns:
87
- bool: True if key was already in filter, False if newly added
88
89
Raises:
90
- IndexError: If filter is at capacity
91
"""
92
93
def __contains__(self, key) -> bool:
94
"""Test key membership in filter (supports 'in' operator)."""
95
96
def __len__(self) -> int:
97
"""Return number of keys stored in filter."""
98
99
def copy(self) -> 'BloomFilter':
100
"""Return a copy of this bloom filter."""
101
102
def union(self, other: 'BloomFilter') -> 'BloomFilter':
103
"""
104
Calculate union with another filter.
105
106
Parameters:
107
- other: Another BloomFilter with same capacity and error_rate
108
109
Returns:
110
- BloomFilter: New filter containing union of both filters
111
112
Raises:
113
- ValueError: If filters have different capacity or error_rate
114
"""
115
116
def intersection(self, other: 'BloomFilter') -> 'BloomFilter':
117
"""
118
Calculate intersection with another filter.
119
120
Parameters:
121
- other: Another BloomFilter with same capacity and error_rate
122
123
Returns:
124
- BloomFilter: New filter containing intersection of both filters
125
126
Raises:
127
- ValueError: If filters have different capacity or error_rate
128
"""
129
130
def __or__(self, other: 'BloomFilter') -> 'BloomFilter':
131
"""Union operator overload (| operator)."""
132
133
def __and__(self, other: 'BloomFilter') -> 'BloomFilter':
134
"""Intersection operator overload (& operator)."""
135
136
def tofile(self, f):
137
"""
138
Write bloom filter to file object.
139
140
Parameters:
141
- f: File-like object for writing binary data
142
"""
143
144
@classmethod
145
def fromfile(cls, f, n: int = -1) -> 'BloomFilter':
146
"""
147
Read bloom filter from file object.
148
149
Parameters:
150
- f: File-like object for reading binary data
151
- n: Maximum bytes to read (-1 for all)
152
153
Returns:
154
- BloomFilter: Deserialized bloom filter
155
156
Raises:
157
- ValueError: If n too small or bit length mismatch
158
"""
159
```
160
161
#### Properties and Attributes
162
163
```python { .api }
164
# Instance attributes (read-only after initialization)
165
bf.error_rate: float # False positive error rate
166
bf.num_slices: int # Number of hash functions
167
bf.bits_per_slice: int # Bits per hash function
168
bf.capacity: int # Maximum elements before degradation
169
bf.num_bits: int # Total bits in filter
170
bf.count: int # Current number of elements
171
bf.bitarray # Underlying bitarray storage
172
173
# Class attributes
174
BloomFilter.FILE_FMT = b'<dQQQQ' # File format for serialization
175
```
176
177
### Scalable Bloom Filter
178
179
Dynamic bloom filter that grows automatically without knowing the original set size, maintaining steady false positive rate by creating internal BloomFilter instances with exponentially increasing capacity.
180
181
```python { .api }
182
class ScalableBloomFilter:
183
def __init__(self, initial_capacity: int = 100, error_rate: float = 0.001,
184
mode: int = 4):
185
"""
186
Initialize scalable bloom filter.
187
188
Parameters:
189
- initial_capacity: Initial capacity of first internal filter
190
- error_rate: Target false positive probability
191
- mode: Growth mode (SMALL_SET_GROWTH=2 or LARGE_SET_GROWTH=4)
192
"""
193
194
def add(self, key) -> bool:
195
"""
196
Add key to scalable filter.
197
198
Parameters:
199
- key: Item to add (string or any hashable object)
200
201
Returns:
202
- bool: True if key was already in filter, False if newly added
203
"""
204
205
def __contains__(self, key) -> bool:
206
"""Test key membership in filter (supports 'in' operator)."""
207
208
def __len__(self) -> int:
209
"""Return total number of elements stored across all internal filters."""
210
211
def union(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':
212
"""
213
Calculate union with another scalable filter.
214
215
Parameters:
216
- other: Another ScalableBloomFilter with same parameters
217
218
Returns:
219
- ScalableBloomFilter: New filter containing union
220
221
Raises:
222
- ValueError: If filters have different mode, initial_capacity, or error_rate
223
"""
224
225
def __or__(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':
226
"""Union operator overload (| operator)."""
227
228
def tofile(self, f):
229
"""
230
Serialize scalable bloom filter to file object.
231
232
Parameters:
233
- f: File-like object for writing binary data
234
"""
235
236
@classmethod
237
def fromfile(cls, f) -> 'ScalableBloomFilter':
238
"""
239
Deserialize scalable bloom filter from file object.
240
241
Parameters:
242
- f: File-like object for reading binary data
243
244
Returns:
245
- ScalableBloomFilter: Deserialized scalable filter
246
"""
247
```
248
249
#### Properties and Constants
250
251
```python { .api }
252
# Properties (computed from internal filters)
253
@property
254
def capacity(self) -> int:
255
"""Total capacity for all internal filters."""
256
257
@property
258
def count(self) -> int:
259
"""Total number of elements (alias for __len__)."""
260
261
# Instance attributes
262
sbf.scale: int # Growth mode constant
263
sbf.ratio: float # Tightening ratio (0.9)
264
sbf.initial_capacity: int # Initial capacity
265
sbf.error_rate: float # Target error rate
266
sbf.filters: list # List of internal BloomFilter objects
267
268
# Class constants
269
ScalableBloomFilter.SMALL_SET_GROWTH = 2 # Slower growth, less memory
270
ScalableBloomFilter.LARGE_SET_GROWTH = 4 # Faster growth, more memory (default)
271
ScalableBloomFilter.FILE_FMT = '<idQd' # File format for serialization
272
```
273
274
### Set Operations
275
276
Both BloomFilter and ScalableBloomFilter support set operations for combining filters with identical parameters.
277
278
```python
279
# Union operations (combine two filters)
280
union_filter = bloom1.union(bloom2)
281
union_filter = bloom1 | bloom2
282
283
# Intersection operations (BloomFilter only)
284
intersection_filter = bloom1.intersection(bloom2)
285
intersection_filter = bloom1 & bloom2
286
287
# Copy operations (BloomFilter only)
288
copied_filter = bloom1.copy()
289
```
290
291
### Serialization
292
293
Both filter types support file serialization and Python pickling for persistence.
294
295
```python
296
# File serialization
297
with open('filter.dat', 'wb') as f:
298
bloom_filter.tofile(f)
299
300
with open('filter.dat', 'rb') as f:
301
loaded_filter = BloomFilter.fromfile(f)
302
303
# Pickle support (automatic via __getstate__/__setstate__)
304
import pickle
305
306
pickled_data = pickle.dumps(bloom_filter)
307
loaded_filter = pickle.loads(pickled_data)
308
```
309
310
## Types
311
312
```python { .api }
313
# Type aliases for documentation
314
HashableKey = Union[str, int, float, bytes, Tuple[Hashable, ...]]
315
FileObject = Union[io.IOBase, typing.BinaryIO]
316
```
317
318
## Error Handling
319
320
The package raises these exceptions:
321
322
- **ValueError**: Invalid error_rate (BloomFilter: not between 0 and 1; ScalableBloomFilter: ≤ 0), invalid capacity (≤ 0), or mismatched filter parameters for set operations
323
- **IndexError**: Attempting to add items beyond BloomFilter capacity
324
- **ImportError**: Missing required bitarray dependency
325
326
## Usage Examples
327
328
### Basic Membership Testing
329
330
```python
331
from pybloom_live import BloomFilter
332
333
# Create filter for expected 10,000 items with 0.1% false positive rate
334
bf = BloomFilter(capacity=10000, error_rate=0.001)
335
336
# Add items
337
for i in range(5000):
338
bf.add(f"user_{i}")
339
340
# Test membership
341
print("user_100" in bf) # True
342
print("user_9999" in bf) # False
343
print(f"Filter contains {len(bf)} items") # Filter contains 5000 items
344
```
345
346
### Scalable Filter for Unknown Set Size
347
348
```python
349
from pybloom_live import ScalableBloomFilter
350
351
# Create scalable filter that grows automatically
352
sbf = ScalableBloomFilter(
353
initial_capacity=1000,
354
error_rate=0.001,
355
mode=ScalableBloomFilter.LARGE_SET_GROWTH
356
)
357
358
# Add large number of items - filter grows automatically
359
for i in range(100000):
360
sbf.add(f"item_{i}")
361
362
print(f"Total capacity: {sbf.capacity}")
363
print(f"Total items: {len(sbf)}")
364
print(f"Number of internal filters: {len(sbf.filters)}")
365
```
366
367
### Combining Filters with Set Operations
368
369
```python
370
from pybloom_live import BloomFilter
371
372
# Create two filters with identical parameters
373
bf1 = BloomFilter(capacity=1000, error_rate=0.001)
374
bf2 = BloomFilter(capacity=1000, error_rate=0.001)
375
376
# Add different items to each
377
for i in range(500):
378
bf1.add(f"set1_item_{i}")
379
bf2.add(f"set2_item_{i}")
380
381
# Union contains items from both filters
382
union_bf = bf1 | bf2
383
384
# Check items exist in union
385
print("set1_item_100" in union_bf) # True
386
print("set2_item_100" in union_bf) # True
387
388
# Intersection contains items present in both (none in this case)
389
intersection_bf = bf1 & bf2
390
print(len(intersection_bf)) # 0 (no common items)
391
```