0
# Streaming Hashers
1
2
hashlib-compatible hasher classes for incremental hashing of large datasets and streaming operations. These classes allow you to hash data incrementally, making them ideal for processing large files, network streams, or when you don't have all data available at once.
3
4
## Capabilities
5
6
### Base Hasher Interface
7
8
Abstract base class defining the common interface for all streaming hashers.
9
10
```python { .api }
11
class Hasher:
12
"""Base class for streaming MurmurHash3 hashers."""
13
14
def __init__(self, seed: int = 0) -> None:
15
"""
16
Initialize hasher with optional seed.
17
18
Args:
19
seed: Seed value for hash computation (default: 0)
20
"""
21
22
def update(self, input: Hashable) -> None:
23
"""
24
Update hasher with new data.
25
26
Args:
27
input: Data to add to hash (bytes, bytearray, memoryview, or array-like)
28
29
Raises:
30
TypeError: If input is a string (strings must be encoded first)
31
"""
32
33
def digest(self) -> bytes:
34
"""
35
Get hash digest as bytes.
36
37
Returns:
38
Hash digest as bytes
39
"""
40
41
def sintdigest(self) -> int:
42
"""
43
Get hash as signed integer.
44
45
Returns:
46
Hash value as signed integer
47
"""
48
49
def uintdigest(self) -> int:
50
"""
51
Get hash as unsigned integer.
52
53
Returns:
54
Hash value as unsigned integer
55
"""
56
57
def copy(self) -> Hasher:
58
"""
59
Create a copy of the hasher's current state.
60
61
Returns:
62
New hasher instance with identical state
63
"""
64
65
@property
66
def digest_size(self) -> int:
67
"""
68
Get digest size in bytes.
69
70
Returns:
71
Number of bytes in digest output
72
"""
73
74
@property
75
def block_size(self) -> int:
76
"""
77
Get internal block size in bytes.
78
79
Returns:
80
Number of bytes processed in each internal block
81
"""
82
83
@property
84
def name(self) -> str:
85
"""
86
Get hasher algorithm name.
87
88
Returns:
89
String identifying the hash algorithm
90
"""
91
```
92
93
### 32-bit Hasher
94
95
Streaming hasher for 32-bit MurmurHash3 computation.
96
97
```python { .api }
98
class mmh3_32(Hasher):
99
"""32-bit MurmurHash3 streaming hasher."""
100
```
101
102
Properties:
103
- `digest_size`: 4 bytes
104
- `block_size`: 12 bytes
105
- `name`: "mmh3_32"
106
107
Example usage:
108
```python
109
import mmh3
110
111
# Basic streaming hashing
112
hasher = mmh3.mmh3_32()
113
hasher.update(b"foo")
114
hasher.update(b"bar")
115
116
# Get results in different formats
117
digest = hasher.digest() # b'\x8d\x8f\xe7\xfd'
118
signed_int = hasher.sintdigest() # -156908512
119
unsigned_int = hasher.uintdigest() # 4138058784
120
121
# With custom seed
122
hasher = mmh3.mmh3_32(seed=42)
123
hasher.update(b"Hello, world!")
124
result = hasher.sintdigest() # Hash with seed
125
126
# Copy hasher state
127
hasher1 = mmh3.mmh3_32()
128
hasher1.update(b"partial data")
129
hasher2 = hasher1.copy() # Copy current state
130
hasher2.update(b" more data") # Continue from copy
131
```
132
133
### 128-bit x64 Hasher
134
135
Streaming hasher for 128-bit MurmurHash3 optimized for x64 architectures.
136
137
```python { .api }
138
class mmh3_x64_128(Hasher):
139
"""128-bit MurmurHash3 streaming hasher optimized for x64 architectures."""
140
141
def stupledigest(self) -> tuple[int, int]:
142
"""
143
Get hash as tuple of two signed 64-bit integers.
144
145
Returns:
146
Tuple of two signed 64-bit integers representing the 128-bit hash
147
"""
148
149
def utupledigest(self) -> tuple[int, int]:
150
"""
151
Get hash as tuple of two unsigned 64-bit integers.
152
153
Returns:
154
Tuple of two unsigned 64-bit integers representing the 128-bit hash
155
"""
156
```
157
158
Properties:
159
- `digest_size`: 16 bytes
160
- `block_size`: 32 bytes
161
- `name`: "mmh3_x64_128"
162
163
Example usage:
164
```python
165
import mmh3
166
167
# 128-bit streaming hashing (x64 optimized)
168
hasher = mmh3.mmh3_x64_128(seed=42)
169
hasher.update(b"foo")
170
hasher.update(b"bar")
171
172
# Get results in various formats
173
digest = hasher.digest() # 16-byte digest
174
signed_int = hasher.sintdigest() # 128-bit signed integer
175
unsigned_int = hasher.uintdigest() # 128-bit unsigned integer
176
177
# Get as tuple of 64-bit integers
178
signed_tuple = hasher.stupledigest() # (7689522670935629698, -159584473158936081)
179
unsigned_tuple = hasher.utupledigest() # (7689522670935629698, 18287159600550615535)
180
181
# Process large streaming data
182
hasher = mmh3.mmh3_x64_128()
183
for chunk in data_stream:
184
hasher.update(chunk)
185
final_hash = hasher.digest()
186
```
187
188
### 128-bit x86 Hasher
189
190
Streaming hasher for 128-bit MurmurHash3 optimized for x86 architectures.
191
192
```python { .api }
193
class mmh3_x86_128(Hasher):
194
"""128-bit MurmurHash3 streaming hasher optimized for x86 architectures."""
195
196
def stupledigest(self) -> tuple[int, int]:
197
"""
198
Get hash as tuple of two signed 64-bit integers.
199
200
Returns:
201
Tuple of two signed 64-bit integers representing the 128-bit hash
202
"""
203
204
def utupledigest(self) -> tuple[int, int]:
205
"""
206
Get hash as tuple of two unsigned 64-bit integers.
207
208
Returns:
209
Tuple of two unsigned 64-bit integers representing the 128-bit hash
210
"""
211
```
212
213
Properties:
214
- `digest_size`: 16 bytes
215
- `block_size`: 32 bytes
216
- `name`: "mmh3_x86_128"
217
218
Example usage:
219
```python
220
import mmh3
221
222
# 128-bit streaming hashing (x86 optimized)
223
hasher = mmh3.mmh3_x86_128(seed=123)
224
hasher.update(b"data chunk 1")
225
hasher.update(b"data chunk 2")
226
227
# Get results (same interface as x64 version)
228
digest = hasher.digest()
229
signed_tuple = hasher.stupledigest()
230
unsigned_tuple = hasher.utupledigest()
231
232
# Architecture-specific optimization
233
# Use mmh3_x86_128 on 32-bit systems for better performance
234
# Use mmh3_x64_128 on 64-bit systems for better performance
235
```
236
237
## Hasher Usage Patterns
238
239
### Incremental File Hashing
240
241
```python
242
import mmh3
243
244
def hash_large_file(filename):
245
hasher = mmh3.mmh3_x64_128()
246
with open(filename, 'rb') as f:
247
while chunk := f.read(8192): # 8KB chunks
248
hasher.update(chunk)
249
return hasher.digest()
250
251
file_hash = hash_large_file('large_dataset.bin')
252
```
253
254
### Stream Processing
255
256
```python
257
import mmh3
258
259
def process_data_stream(data_stream):
260
hasher = mmh3.mmh3_32(seed=42)
261
for data_chunk in data_stream:
262
# Process data
263
processed = process_chunk(data_chunk)
264
# Update hash incrementally
265
hasher.update(processed)
266
return hasher.uintdigest()
267
```
268
269
### State Copying for Parallel Processing
270
271
```python
272
import mmh3
273
274
# Process common prefix once
275
base_hasher = mmh3.mmh3_x64_128()
276
base_hasher.update(b"common prefix data")
277
278
# Branch processing from common state
279
hasher1 = base_hasher.copy()
280
hasher1.update(b"branch 1 data")
281
result1 = hasher1.digest()
282
283
hasher2 = base_hasher.copy()
284
hasher2.update(b"branch 2 data")
285
result2 = hasher2.digest()
286
```
287
288
## Error Handling and Constraints
289
290
### Input Type Restrictions
291
292
Hashers only accept binary data types:
293
- ✅ `bytes`, `bytearray`, `memoryview`, array-like objects
294
- ❌ `str` - strings must be encoded first
295
296
```python
297
import mmh3
298
299
hasher = mmh3.mmh3_32()
300
301
# Correct usage
302
hasher.update(b"binary data")
303
hasher.update("text data".encode('utf-8')) # Encode strings first
304
305
# This will raise TypeError
306
# hasher.update("raw string") # TypeError: Strings must be encoded before hashing
307
```
308
309
### Thread Safety
310
311
All hasher instances are independent and thread-safe. However, individual hasher objects should not be shared between threads without proper synchronization.
312
313
### Performance Considerations
314
315
- **x64 vs x86**: Choose the appropriate hasher for your target architecture
316
- **Memory efficiency**: Hashers maintain minimal internal state
317
- **Chunk size**: Optimal chunk sizes are typically 4KB-64KB for file processing
318
- **State copying**: `copy()` is lightweight and creates minimal overhead
319
320
### Hasher Lifecycle
321
322
1. **Initialize**: Create hasher with optional seed
323
2. **Update**: Add data incrementally with `update()`
324
3. **Finalize**: Get results with `digest()`, `sintdigest()`, `uintdigest()`
325
4. **Copy**: Branch processing with `copy()` if needed
326
5. **Reuse**: Hashers can continue to be updated after getting results