Tessl Tile for pypi/pybloom-live@3.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# Pybloom Live
1

2
A Python implementation of Bloom filter probabilistic data structures, forked from pybloom with an improved tightening ratio of 0.9 for better average space usage across wide ranges of growth. Provides both standard Bloom Filters for known set sizes and Scalable Bloom Filters that can grow dynamically without knowing the original set size.
3

4
## Package Information
5

6
- **Package Name**: pybloom-live
7
- **Language**: Python
8
- **Installation**: `pip install pybloom-live`
9
- **Dependencies**: bitarray>=0.3.4
10

11
## Core Imports
12

13
```python
14
import pybloom_live
15
```
16

17
Standard pattern for accessing classes:
18

19
```python
20
from pybloom_live import BloomFilter, ScalableBloomFilter
21
```
22

23
## Basic Usage
24

25
```python
26
from pybloom_live import BloomFilter, ScalableBloomFilter
27

28
# Create a standard bloom filter for known set size
29
bf = BloomFilter(capacity=1000, error_rate=0.001)
30

31
# Add items to filter
32
bf.add("item1")
33
bf.add("item2") 
34
bf.add("item3")
35

36
# Test membership (may have false positives, never false negatives)
37
print("item1" in bf)  # True
38
print("item4" in bf)  # False (or rarely True - false positive)
39

40
# Create scalable bloom filter for unknown set size
41
sbf = ScalableBloomFilter(initial_capacity=100, error_rate=0.001)
42
for i in range(10000):
43
    sbf.add(f"item{i}")
44

45
print(f"item5000" in sbf)  # True
46
print(len(sbf))  # 10000
47
```
48

49
## Architecture
50

51
Pybloom-live implements two main probabilistic data structures:
52

53
- **BloomFilter**: Space-efficient probabilistic data structure for fixed-size sets with configurable false positive rate
54
- **ScalableBloomFilter**: Dynamic bloom filter that grows automatically by creating internal BloomFilter instances with exponentially increasing capacity
55

56
Both implementations use optimized hash function generation (MD5, SHA1, SHA256, SHA384, SHA512) based on required hash bits, and support set operations (union, intersection), serialization to files, and pickling for persistence.
57

58
## Capabilities
59

60
### Standard Bloom Filter
61

62
Fixed-capacity probabilistic data structure for set membership testing with configurable false positive probability. Ideal when the maximum set size is known in advance.
63

64
```python { .api }
65
class BloomFilter:
66
    def __init__(self, capacity: int, error_rate: float = 0.001):
67
        """
68
        Initialize bloom filter with fixed capacity.
69
        
70
        Parameters:
71
        - capacity: Maximum number of elements before degradation
72
        - error_rate: Target false positive probability (0 < error_rate < 1)
73
        
74
        Raises:
75
        - ValueError: If error_rate not between 0 and 1, or capacity <= 0
76
        """
77
    
78
    def add(self, key, skip_check: bool = False) -> bool:
79
        """
80
        Add a key to the bloom filter.
81
        
82
        Parameters:
83
        - key: Item to add (string or any hashable object)
84
        - skip_check: Skip membership check for performance
85
        
86
        Returns:
87
        - bool: True if key was already in filter, False if newly added
88
        
89
        Raises:
90
        - IndexError: If filter is at capacity
91
        """
92
    
93
    def __contains__(self, key) -> bool:
94
        """Test key membership in filter (supports 'in' operator)."""
95
    
96
    def __len__(self) -> int:
97
        """Return number of keys stored in filter."""
98
    
99
    def copy(self) -> 'BloomFilter':
100
        """Return a copy of this bloom filter."""
101
    
102
    def union(self, other: 'BloomFilter') -> 'BloomFilter':
103
        """
104
        Calculate union with another filter.
105
        
106
        Parameters:
107
        - other: Another BloomFilter with same capacity and error_rate
108
        
109
        Returns:
110
        - BloomFilter: New filter containing union of both filters
111
        
112
        Raises:
113
        - ValueError: If filters have different capacity or error_rate
114
        """
115
    
116
    def intersection(self, other: 'BloomFilter') -> 'BloomFilter':
117
        """
118
        Calculate intersection with another filter.
119
        
120
        Parameters:
121
        - other: Another BloomFilter with same capacity and error_rate
122
        
123
        Returns:
124
        - BloomFilter: New filter containing intersection of both filters
125
        
126
        Raises:
127
        - ValueError: If filters have different capacity or error_rate
128
        """
129
    
130
    def __or__(self, other: 'BloomFilter') -> 'BloomFilter':
131
        """Union operator overload (| operator)."""
132
    
133
    def __and__(self, other: 'BloomFilter') -> 'BloomFilter':
134
        """Intersection operator overload (& operator)."""
135
    
136
    def tofile(self, f):
137
        """
138
        Write bloom filter to file object.
139
        
140
        Parameters:
141
        - f: File-like object for writing binary data
142
        """
143
    
144
    @classmethod
145
    def fromfile(cls, f, n: int = -1) -> 'BloomFilter':
146
        """
147
        Read bloom filter from file object.
148
        
149
        Parameters:
150
        - f: File-like object for reading binary data
151
        - n: Maximum bytes to read (-1 for all)
152
        
153
        Returns:
154
        - BloomFilter: Deserialized bloom filter
155
        
156
        Raises:
157
        - ValueError: If n too small or bit length mismatch
158
        """
159
```
160

161
#### Properties and Attributes
162

163
```python { .api }
164
# Instance attributes (read-only after initialization)
165
bf.error_rate: float          # False positive error rate
166
bf.num_slices: int           # Number of hash functions  
167
bf.bits_per_slice: int       # Bits per hash function
168
bf.capacity: int             # Maximum elements before degradation
169
bf.num_bits: int             # Total bits in filter
170
bf.count: int                # Current number of elements
171
bf.bitarray                  # Underlying bitarray storage
172

173
# Class attributes
174
BloomFilter.FILE_FMT = b'<dQQQQ'  # File format for serialization
175
```
176

177
### Scalable Bloom Filter
178

179
Dynamic bloom filter that grows automatically without knowing the original set size, maintaining steady false positive rate by creating internal BloomFilter instances with exponentially increasing capacity.
180

181
```python { .api }
182
class ScalableBloomFilter:
183
    def __init__(self, initial_capacity: int = 100, error_rate: float = 0.001, 
184
                 mode: int = 4):
185
        """
186
        Initialize scalable bloom filter.
187
        
188
        Parameters:
189
        - initial_capacity: Initial capacity of first internal filter
190
        - error_rate: Target false positive probability
191
        - mode: Growth mode (SMALL_SET_GROWTH=2 or LARGE_SET_GROWTH=4)
192
        """
193
    
194
    def add(self, key) -> bool:
195
        """
196
        Add key to scalable filter.
197
        
198
        Parameters:
199
        - key: Item to add (string or any hashable object)
200
        
201
        Returns:
202
        - bool: True if key was already in filter, False if newly added
203
        """
204
    
205
    def __contains__(self, key) -> bool:
206
        """Test key membership in filter (supports 'in' operator)."""
207
    
208
    def __len__(self) -> int:
209
        """Return total number of elements stored across all internal filters."""
210
    
211
    def union(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':
212
        """
213
        Calculate union with another scalable filter.
214
        
215
        Parameters:
216
        - other: Another ScalableBloomFilter with same parameters
217
        
218
        Returns:
219
        - ScalableBloomFilter: New filter containing union
220
        
221
        Raises:
222
        - ValueError: If filters have different mode, initial_capacity, or error_rate
223
        """
224
    
225
    def __or__(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':
226
        """Union operator overload (| operator)."""
227
    
228
    def tofile(self, f):
229
        """
230
        Serialize scalable bloom filter to file object.
231
        
232
        Parameters:
233
        - f: File-like object for writing binary data
234
        """
235
    
236
    @classmethod
237
    def fromfile(cls, f) -> 'ScalableBloomFilter':
238
        """
239
        Deserialize scalable bloom filter from file object.
240
        
241
        Parameters:
242
        - f: File-like object for reading binary data
243
        
244
        Returns:
245
        - ScalableBloomFilter: Deserialized scalable filter
246
        """
247
```
248

249
#### Properties and Constants
250

251
```python { .api }
252
# Properties (computed from internal filters)
253
@property
254
def capacity(self) -> int:
255
    """Total capacity for all internal filters."""
256

257
@property  
258
def count(self) -> int:
259
    """Total number of elements (alias for __len__)."""
260

261
# Instance attributes
262
sbf.scale: int                # Growth mode constant
263
sbf.ratio: float             # Tightening ratio (0.9)
264
sbf.initial_capacity: int    # Initial capacity
265
sbf.error_rate: float        # Target error rate
266
sbf.filters: list            # List of internal BloomFilter objects
267

268
# Class constants
269
ScalableBloomFilter.SMALL_SET_GROWTH = 2   # Slower growth, less memory
270
ScalableBloomFilter.LARGE_SET_GROWTH = 4   # Faster growth, more memory (default)
271
ScalableBloomFilter.FILE_FMT = '<idQd'     # File format for serialization
272
```
273

274
### Set Operations
275

276
Both BloomFilter and ScalableBloomFilter support set operations for combining filters with identical parameters.
277

278
```python
279
# Union operations (combine two filters)
280
union_filter = bloom1.union(bloom2)
281
union_filter = bloom1 | bloom2
282

283
# Intersection operations (BloomFilter only)
284
intersection_filter = bloom1.intersection(bloom2)  
285
intersection_filter = bloom1 & bloom2
286

287
# Copy operations (BloomFilter only)
288
copied_filter = bloom1.copy()
289
```
290

291
### Serialization
292

293
Both filter types support file serialization and Python pickling for persistence.
294

295
```python
296
# File serialization
297
with open('filter.dat', 'wb') as f:
298
    bloom_filter.tofile(f)
299

300
with open('filter.dat', 'rb') as f:
301
    loaded_filter = BloomFilter.fromfile(f)
302

303
# Pickle support (automatic via __getstate__/__setstate__)
304
import pickle
305

306
pickled_data = pickle.dumps(bloom_filter)
307
loaded_filter = pickle.loads(pickled_data)
308
```
309

310
## Types
311

312
```python { .api }
313
# Type aliases for documentation
314
HashableKey = Union[str, int, float, bytes, Tuple[Hashable, ...]]
315
FileObject = Union[io.IOBase, typing.BinaryIO]
316
```
317

318
## Error Handling
319

320
The package raises these exceptions:
321

322
- **ValueError**: Invalid error_rate (BloomFilter: not between 0 and 1; ScalableBloomFilter: ≤ 0), invalid capacity (≤ 0), or mismatched filter parameters for set operations
323
- **IndexError**: Attempting to add items beyond BloomFilter capacity
324
- **ImportError**: Missing required bitarray dependency
325

326
## Usage Examples
327

328
### Basic Membership Testing
329

330
```python
331
from pybloom_live import BloomFilter
332

333
# Create filter for expected 10,000 items with 0.1% false positive rate
334
bf = BloomFilter(capacity=10000, error_rate=0.001)
335

336
# Add items
337
for i in range(5000):
338
    bf.add(f"user_{i}")
339

340
# Test membership
341
print("user_100" in bf)    # True
342
print("user_9999" in bf)   # False
343
print(f"Filter contains {len(bf)} items")  # Filter contains 5000 items
344
```
345

346
### Scalable Filter for Unknown Set Size
347

348
```python
349
from pybloom_live import ScalableBloomFilter
350

351
# Create scalable filter that grows automatically
352
sbf = ScalableBloomFilter(
353
    initial_capacity=1000,
354
    error_rate=0.001,
355
    mode=ScalableBloomFilter.LARGE_SET_GROWTH
356
)
357

358
# Add large number of items - filter grows automatically
359
for i in range(100000):
360
    sbf.add(f"item_{i}")
361

362
print(f"Total capacity: {sbf.capacity}")
363
print(f"Total items: {len(sbf)}")
364
print(f"Number of internal filters: {len(sbf.filters)}")
365
```
366

367
### Combining Filters with Set Operations
368

369
```python
370
from pybloom_live import BloomFilter
371

372
# Create two filters with identical parameters
373
bf1 = BloomFilter(capacity=1000, error_rate=0.001)
374
bf2 = BloomFilter(capacity=1000, error_rate=0.001)
375

376
# Add different items to each
377
for i in range(500):
378
    bf1.add(f"set1_item_{i}")
379
    bf2.add(f"set2_item_{i}")
380

381
# Union contains items from both filters
382
union_bf = bf1 | bf2
383

384
# Check items exist in union
385
print("set1_item_100" in union_bf)  # True
386
print("set2_item_100" in union_bf)  # True
387

388
# Intersection contains items present in both (none in this case)
389
intersection_bf = bf1 & bf2
390
print(len(intersection_bf))  # 0 (no common items)
391
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/