or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

hashers.mdindex.mdsimple-functions.md

hashers.mddocs/

0

# Streaming Hashers

1

2

hashlib-compatible hasher classes for incremental hashing of large datasets and streaming operations. These classes allow you to hash data incrementally, making them ideal for processing large files, network streams, or when you don't have all data available at once.

3

4

## Capabilities

5

6

### Base Hasher Interface

7

8

Abstract base class defining the common interface for all streaming hashers.

9

10

```python { .api }

11

class Hasher:

12

"""Base class for streaming MurmurHash3 hashers."""

13

14

def __init__(self, seed: int = 0) -> None:

15

"""

16

Initialize hasher with optional seed.

17

18

Args:

19

seed: Seed value for hash computation (default: 0)

20

"""

21

22

def update(self, input: Hashable) -> None:

23

"""

24

Update hasher with new data.

25

26

Args:

27

input: Data to add to hash (bytes, bytearray, memoryview, or array-like)

28

29

Raises:

30

TypeError: If input is a string (strings must be encoded first)

31

"""

32

33

def digest(self) -> bytes:

34

"""

35

Get hash digest as bytes.

36

37

Returns:

38

Hash digest as bytes

39

"""

40

41

def sintdigest(self) -> int:

42

"""

43

Get hash as signed integer.

44

45

Returns:

46

Hash value as signed integer

47

"""

48

49

def uintdigest(self) -> int:

50

"""

51

Get hash as unsigned integer.

52

53

Returns:

54

Hash value as unsigned integer

55

"""

56

57

def copy(self) -> Hasher:

58

"""

59

Create a copy of the hasher's current state.

60

61

Returns:

62

New hasher instance with identical state

63

"""

64

65

@property

66

def digest_size(self) -> int:

67

"""

68

Get digest size in bytes.

69

70

Returns:

71

Number of bytes in digest output

72

"""

73

74

@property

75

def block_size(self) -> int:

76

"""

77

Get internal block size in bytes.

78

79

Returns:

80

Number of bytes processed in each internal block

81

"""

82

83

@property

84

def name(self) -> str:

85

"""

86

Get hasher algorithm name.

87

88

Returns:

89

String identifying the hash algorithm

90

"""

91

```

92

93

### 32-bit Hasher

94

95

Streaming hasher for 32-bit MurmurHash3 computation.

96

97

```python { .api }

98

class mmh3_32(Hasher):

99

"""32-bit MurmurHash3 streaming hasher."""

100

```

101

102

Properties:

103

- `digest_size`: 4 bytes

104

- `block_size`: 12 bytes

105

- `name`: "mmh3_32"

106

107

Example usage:

108

```python

109

import mmh3

110

111

# Basic streaming hashing

112

hasher = mmh3.mmh3_32()

113

hasher.update(b"foo")

114

hasher.update(b"bar")

115

116

# Get results in different formats

117

digest = hasher.digest() # b'\x8d\x8f\xe7\xfd'

118

signed_int = hasher.sintdigest() # -156908512

119

unsigned_int = hasher.uintdigest() # 4138058784

120

121

# With custom seed

122

hasher = mmh3.mmh3_32(seed=42)

123

hasher.update(b"Hello, world!")

124

result = hasher.sintdigest() # Hash with seed

125

126

# Copy hasher state

127

hasher1 = mmh3.mmh3_32()

128

hasher1.update(b"partial data")

129

hasher2 = hasher1.copy() # Copy current state

130

hasher2.update(b" more data") # Continue from copy

131

```

132

133

### 128-bit x64 Hasher

134

135

Streaming hasher for 128-bit MurmurHash3 optimized for x64 architectures.

136

137

```python { .api }

138

class mmh3_x64_128(Hasher):

139

"""128-bit MurmurHash3 streaming hasher optimized for x64 architectures."""

140

141

def stupledigest(self) -> tuple[int, int]:

142

"""

143

Get hash as tuple of two signed 64-bit integers.

144

145

Returns:

146

Tuple of two signed 64-bit integers representing the 128-bit hash

147

"""

148

149

def utupledigest(self) -> tuple[int, int]:

150

"""

151

Get hash as tuple of two unsigned 64-bit integers.

152

153

Returns:

154

Tuple of two unsigned 64-bit integers representing the 128-bit hash

155

"""

156

```

157

158

Properties:

159

- `digest_size`: 16 bytes

160

- `block_size`: 32 bytes

161

- `name`: "mmh3_x64_128"

162

163

Example usage:

164

```python

165

import mmh3

166

167

# 128-bit streaming hashing (x64 optimized)

168

hasher = mmh3.mmh3_x64_128(seed=42)

169

hasher.update(b"foo")

170

hasher.update(b"bar")

171

172

# Get results in various formats

173

digest = hasher.digest() # 16-byte digest

174

signed_int = hasher.sintdigest() # 128-bit signed integer

175

unsigned_int = hasher.uintdigest() # 128-bit unsigned integer

176

177

# Get as tuple of 64-bit integers

178

signed_tuple = hasher.stupledigest() # (7689522670935629698, -159584473158936081)

179

unsigned_tuple = hasher.utupledigest() # (7689522670935629698, 18287159600550615535)

180

181

# Process large streaming data

182

hasher = mmh3.mmh3_x64_128()

183

for chunk in data_stream:

184

hasher.update(chunk)

185

final_hash = hasher.digest()

186

```

187

188

### 128-bit x86 Hasher

189

190

Streaming hasher for 128-bit MurmurHash3 optimized for x86 architectures.

191

192

```python { .api }

193

class mmh3_x86_128(Hasher):

194

"""128-bit MurmurHash3 streaming hasher optimized for x86 architectures."""

195

196

def stupledigest(self) -> tuple[int, int]:

197

"""

198

Get hash as tuple of two signed 64-bit integers.

199

200

Returns:

201

Tuple of two signed 64-bit integers representing the 128-bit hash

202

"""

203

204

def utupledigest(self) -> tuple[int, int]:

205

"""

206

Get hash as tuple of two unsigned 64-bit integers.

207

208

Returns:

209

Tuple of two unsigned 64-bit integers representing the 128-bit hash

210

"""

211

```

212

213

Properties:

214

- `digest_size`: 16 bytes

215

- `block_size`: 32 bytes

216

- `name`: "mmh3_x86_128"

217

218

Example usage:

219

```python

220

import mmh3

221

222

# 128-bit streaming hashing (x86 optimized)

223

hasher = mmh3.mmh3_x86_128(seed=123)

224

hasher.update(b"data chunk 1")

225

hasher.update(b"data chunk 2")

226

227

# Get results (same interface as x64 version)

228

digest = hasher.digest()

229

signed_tuple = hasher.stupledigest()

230

unsigned_tuple = hasher.utupledigest()

231

232

# Architecture-specific optimization

233

# Use mmh3_x86_128 on 32-bit systems for better performance

234

# Use mmh3_x64_128 on 64-bit systems for better performance

235

```

236

237

## Hasher Usage Patterns

238

239

### Incremental File Hashing

240

241

```python

242

import mmh3

243

244

def hash_large_file(filename):

245

hasher = mmh3.mmh3_x64_128()

246

with open(filename, 'rb') as f:

247

while chunk := f.read(8192): # 8KB chunks

248

hasher.update(chunk)

249

return hasher.digest()

250

251

file_hash = hash_large_file('large_dataset.bin')

252

```

253

254

### Stream Processing

255

256

```python

257

import mmh3

258

259

def process_data_stream(data_stream):

260

hasher = mmh3.mmh3_32(seed=42)

261

for data_chunk in data_stream:

262

# Process data

263

processed = process_chunk(data_chunk)

264

# Update hash incrementally

265

hasher.update(processed)

266

return hasher.uintdigest()

267

```

268

269

### State Copying for Parallel Processing

270

271

```python

272

import mmh3

273

274

# Process common prefix once

275

base_hasher = mmh3.mmh3_x64_128()

276

base_hasher.update(b"common prefix data")

277

278

# Branch processing from common state

279

hasher1 = base_hasher.copy()

280

hasher1.update(b"branch 1 data")

281

result1 = hasher1.digest()

282

283

hasher2 = base_hasher.copy()

284

hasher2.update(b"branch 2 data")

285

result2 = hasher2.digest()

286

```

287

288

## Error Handling and Constraints

289

290

### Input Type Restrictions

291

292

Hashers only accept binary data types:

293

-`bytes`, `bytearray`, `memoryview`, array-like objects

294

-`str` - strings must be encoded first

295

296

```python

297

import mmh3

298

299

hasher = mmh3.mmh3_32()

300

301

# Correct usage

302

hasher.update(b"binary data")

303

hasher.update("text data".encode('utf-8')) # Encode strings first

304

305

# This will raise TypeError

306

# hasher.update("raw string") # TypeError: Strings must be encoded before hashing

307

```

308

309

### Thread Safety

310

311

All hasher instances are independent and thread-safe. However, individual hasher objects should not be shared between threads without proper synchronization.

312

313

### Performance Considerations

314

315

- **x64 vs x86**: Choose the appropriate hasher for your target architecture

316

- **Memory efficiency**: Hashers maintain minimal internal state

317

- **Chunk size**: Optimal chunk sizes are typically 4KB-64KB for file processing

318

- **State copying**: `copy()` is lightweight and creates minimal overhead

319

320

### Hasher Lifecycle

321

322

1. **Initialize**: Create hasher with optional seed

323

2. **Update**: Add data incrementally with `update()`

324

3. **Finalize**: Get results with `digest()`, `sintdigest()`, `uintdigest()`

325

4. **Copy**: Branch processing with `copy()` if needed

326

5. **Reuse**: Hashers can continue to be updated after getting results