or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# Pybloom Live

1

2

A Python implementation of Bloom filter probabilistic data structures, forked from pybloom with an improved tightening ratio of 0.9 for better average space usage across wide ranges of growth. Provides both standard Bloom Filters for known set sizes and Scalable Bloom Filters that can grow dynamically without knowing the original set size.

3

4

## Package Information

5

6

- **Package Name**: pybloom-live

7

- **Language**: Python

8

- **Installation**: `pip install pybloom-live`

9

- **Dependencies**: bitarray>=0.3.4

10

11

## Core Imports

12

13

```python

14

import pybloom_live

15

```

16

17

Standard pattern for accessing classes:

18

19

```python

20

from pybloom_live import BloomFilter, ScalableBloomFilter

21

```

22

23

## Basic Usage

24

25

```python

26

from pybloom_live import BloomFilter, ScalableBloomFilter

27

28

# Create a standard bloom filter for known set size

29

bf = BloomFilter(capacity=1000, error_rate=0.001)

30

31

# Add items to filter

32

bf.add("item1")

33

bf.add("item2")

34

bf.add("item3")

35

36

# Test membership (may have false positives, never false negatives)

37

print("item1" in bf) # True

38

print("item4" in bf) # False (or rarely True - false positive)

39

40

# Create scalable bloom filter for unknown set size

41

sbf = ScalableBloomFilter(initial_capacity=100, error_rate=0.001)

42

for i in range(10000):

43

sbf.add(f"item{i}")

44

45

print(f"item5000" in sbf) # True

46

print(len(sbf)) # 10000

47

```

48

49

## Architecture

50

51

Pybloom-live implements two main probabilistic data structures:

52

53

- **BloomFilter**: Space-efficient probabilistic data structure for fixed-size sets with configurable false positive rate

54

- **ScalableBloomFilter**: Dynamic bloom filter that grows automatically by creating internal BloomFilter instances with exponentially increasing capacity

55

56

Both implementations use optimized hash function generation (MD5, SHA1, SHA256, SHA384, SHA512) based on required hash bits, and support set operations (union, intersection), serialization to files, and pickling for persistence.

57

58

## Capabilities

59

60

### Standard Bloom Filter

61

62

Fixed-capacity probabilistic data structure for set membership testing with configurable false positive probability. Ideal when the maximum set size is known in advance.

63

64

```python { .api }

65

class BloomFilter:

66

def __init__(self, capacity: int, error_rate: float = 0.001):

67

"""

68

Initialize bloom filter with fixed capacity.

69

70

Parameters:

71

- capacity: Maximum number of elements before degradation

72

- error_rate: Target false positive probability (0 < error_rate < 1)

73

74

Raises:

75

- ValueError: If error_rate not between 0 and 1, or capacity <= 0

76

"""

77

78

def add(self, key, skip_check: bool = False) -> bool:

79

"""

80

Add a key to the bloom filter.

81

82

Parameters:

83

- key: Item to add (string or any hashable object)

84

- skip_check: Skip membership check for performance

85

86

Returns:

87

- bool: True if key was already in filter, False if newly added

88

89

Raises:

90

- IndexError: If filter is at capacity

91

"""

92

93

def __contains__(self, key) -> bool:

94

"""Test key membership in filter (supports 'in' operator)."""

95

96

def __len__(self) -> int:

97

"""Return number of keys stored in filter."""

98

99

def copy(self) -> 'BloomFilter':

100

"""Return a copy of this bloom filter."""

101

102

def union(self, other: 'BloomFilter') -> 'BloomFilter':

103

"""

104

Calculate union with another filter.

105

106

Parameters:

107

- other: Another BloomFilter with same capacity and error_rate

108

109

Returns:

110

- BloomFilter: New filter containing union of both filters

111

112

Raises:

113

- ValueError: If filters have different capacity or error_rate

114

"""

115

116

def intersection(self, other: 'BloomFilter') -> 'BloomFilter':

117

"""

118

Calculate intersection with another filter.

119

120

Parameters:

121

- other: Another BloomFilter with same capacity and error_rate

122

123

Returns:

124

- BloomFilter: New filter containing intersection of both filters

125

126

Raises:

127

- ValueError: If filters have different capacity or error_rate

128

"""

129

130

def __or__(self, other: 'BloomFilter') -> 'BloomFilter':

131

"""Union operator overload (| operator)."""

132

133

def __and__(self, other: 'BloomFilter') -> 'BloomFilter':

134

"""Intersection operator overload (& operator)."""

135

136

def tofile(self, f):

137

"""

138

Write bloom filter to file object.

139

140

Parameters:

141

- f: File-like object for writing binary data

142

"""

143

144

@classmethod

145

def fromfile(cls, f, n: int = -1) -> 'BloomFilter':

146

"""

147

Read bloom filter from file object.

148

149

Parameters:

150

- f: File-like object for reading binary data

151

- n: Maximum bytes to read (-1 for all)

152

153

Returns:

154

- BloomFilter: Deserialized bloom filter

155

156

Raises:

157

- ValueError: If n too small or bit length mismatch

158

"""

159

```

160

161

#### Properties and Attributes

162

163

```python { .api }

164

# Instance attributes (read-only after initialization)

165

bf.error_rate: float # False positive error rate

166

bf.num_slices: int # Number of hash functions

167

bf.bits_per_slice: int # Bits per hash function

168

bf.capacity: int # Maximum elements before degradation

169

bf.num_bits: int # Total bits in filter

170

bf.count: int # Current number of elements

171

bf.bitarray # Underlying bitarray storage

172

173

# Class attributes

174

BloomFilter.FILE_FMT = b'<dQQQQ' # File format for serialization

175

```

176

177

### Scalable Bloom Filter

178

179

Dynamic bloom filter that grows automatically without knowing the original set size, maintaining steady false positive rate by creating internal BloomFilter instances with exponentially increasing capacity.

180

181

```python { .api }

182

class ScalableBloomFilter:

183

def __init__(self, initial_capacity: int = 100, error_rate: float = 0.001,

184

mode: int = 4):

185

"""

186

Initialize scalable bloom filter.

187

188

Parameters:

189

- initial_capacity: Initial capacity of first internal filter

190

- error_rate: Target false positive probability

191

- mode: Growth mode (SMALL_SET_GROWTH=2 or LARGE_SET_GROWTH=4)

192

"""

193

194

def add(self, key) -> bool:

195

"""

196

Add key to scalable filter.

197

198

Parameters:

199

- key: Item to add (string or any hashable object)

200

201

Returns:

202

- bool: True if key was already in filter, False if newly added

203

"""

204

205

def __contains__(self, key) -> bool:

206

"""Test key membership in filter (supports 'in' operator)."""

207

208

def __len__(self) -> int:

209

"""Return total number of elements stored across all internal filters."""

210

211

def union(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':

212

"""

213

Calculate union with another scalable filter.

214

215

Parameters:

216

- other: Another ScalableBloomFilter with same parameters

217

218

Returns:

219

- ScalableBloomFilter: New filter containing union

220

221

Raises:

222

- ValueError: If filters have different mode, initial_capacity, or error_rate

223

"""

224

225

def __or__(self, other: 'ScalableBloomFilter') -> 'ScalableBloomFilter':

226

"""Union operator overload (| operator)."""

227

228

def tofile(self, f):

229

"""

230

Serialize scalable bloom filter to file object.

231

232

Parameters:

233

- f: File-like object for writing binary data

234

"""

235

236

@classmethod

237

def fromfile(cls, f) -> 'ScalableBloomFilter':

238

"""

239

Deserialize scalable bloom filter from file object.

240

241

Parameters:

242

- f: File-like object for reading binary data

243

244

Returns:

245

- ScalableBloomFilter: Deserialized scalable filter

246

"""

247

```

248

249

#### Properties and Constants

250

251

```python { .api }

252

# Properties (computed from internal filters)

253

@property

254

def capacity(self) -> int:

255

"""Total capacity for all internal filters."""

256

257

@property

258

def count(self) -> int:

259

"""Total number of elements (alias for __len__)."""

260

261

# Instance attributes

262

sbf.scale: int # Growth mode constant

263

sbf.ratio: float # Tightening ratio (0.9)

264

sbf.initial_capacity: int # Initial capacity

265

sbf.error_rate: float # Target error rate

266

sbf.filters: list # List of internal BloomFilter objects

267

268

# Class constants

269

ScalableBloomFilter.SMALL_SET_GROWTH = 2 # Slower growth, less memory

270

ScalableBloomFilter.LARGE_SET_GROWTH = 4 # Faster growth, more memory (default)

271

ScalableBloomFilter.FILE_FMT = '<idQd' # File format for serialization

272

```

273

274

### Set Operations

275

276

Both BloomFilter and ScalableBloomFilter support set operations for combining filters with identical parameters.

277

278

```python

279

# Union operations (combine two filters)

280

union_filter = bloom1.union(bloom2)

281

union_filter = bloom1 | bloom2

282

283

# Intersection operations (BloomFilter only)

284

intersection_filter = bloom1.intersection(bloom2)

285

intersection_filter = bloom1 & bloom2

286

287

# Copy operations (BloomFilter only)

288

copied_filter = bloom1.copy()

289

```

290

291

### Serialization

292

293

Both filter types support file serialization and Python pickling for persistence.

294

295

```python

296

# File serialization

297

with open('filter.dat', 'wb') as f:

298

bloom_filter.tofile(f)

299

300

with open('filter.dat', 'rb') as f:

301

loaded_filter = BloomFilter.fromfile(f)

302

303

# Pickle support (automatic via __getstate__/__setstate__)

304

import pickle

305

306

pickled_data = pickle.dumps(bloom_filter)

307

loaded_filter = pickle.loads(pickled_data)

308

```

309

310

## Types

311

312

```python { .api }

313

# Type aliases for documentation

314

HashableKey = Union[str, int, float, bytes, Tuple[Hashable, ...]]

315

FileObject = Union[io.IOBase, typing.BinaryIO]

316

```

317

318

## Error Handling

319

320

The package raises these exceptions:

321

322

- **ValueError**: Invalid error_rate (BloomFilter: not between 0 and 1; ScalableBloomFilter: ≤ 0), invalid capacity (≤ 0), or mismatched filter parameters for set operations

323

- **IndexError**: Attempting to add items beyond BloomFilter capacity

324

- **ImportError**: Missing required bitarray dependency

325

326

## Usage Examples

327

328

### Basic Membership Testing

329

330

```python

331

from pybloom_live import BloomFilter

332

333

# Create filter for expected 10,000 items with 0.1% false positive rate

334

bf = BloomFilter(capacity=10000, error_rate=0.001)

335

336

# Add items

337

for i in range(5000):

338

bf.add(f"user_{i}")

339

340

# Test membership

341

print("user_100" in bf) # True

342

print("user_9999" in bf) # False

343

print(f"Filter contains {len(bf)} items") # Filter contains 5000 items

344

```

345

346

### Scalable Filter for Unknown Set Size

347

348

```python

349

from pybloom_live import ScalableBloomFilter

350

351

# Create scalable filter that grows automatically

352

sbf = ScalableBloomFilter(

353

initial_capacity=1000,

354

error_rate=0.001,

355

mode=ScalableBloomFilter.LARGE_SET_GROWTH

356

)

357

358

# Add large number of items - filter grows automatically

359

for i in range(100000):

360

sbf.add(f"item_{i}")

361

362

print(f"Total capacity: {sbf.capacity}")

363

print(f"Total items: {len(sbf)}")

364

print(f"Number of internal filters: {len(sbf.filters)}")

365

```

366

367

### Combining Filters with Set Operations

368

369

```python

370

from pybloom_live import BloomFilter

371

372

# Create two filters with identical parameters

373

bf1 = BloomFilter(capacity=1000, error_rate=0.001)

374

bf2 = BloomFilter(capacity=1000, error_rate=0.001)

375

376

# Add different items to each

377

for i in range(500):

378

bf1.add(f"set1_item_{i}")

379

bf2.add(f"set2_item_{i}")

380

381

# Union contains items from both filters

382

union_bf = bf1 | bf2

383

384

# Check items exist in union

385

print("set1_item_100" in union_bf) # True

386

print("set2_item_100" in union_bf) # True

387

388

# Intersection contains items present in both (none in this case)

389

intersection_bf = bf1 & bf2

390

print(len(intersection_bf)) # 0 (no common items)

391

```