or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-export.mddataset-reading.mdhigh-performance-writing.mdindex.mdmemory-mapping-utils.md

high-performance-writing.mddocs/

0

# High-Performance Writing

1

2

Low-level writer classes for streaming large datasets to HDF5 format with optimal memory usage, parallel writing support, and specialized column writers for different data types.

3

4

## Capabilities

5

6

### Main Writer Class

7

8

The primary interface for high-performance HDF5 writing with memory mapping and parallel processing support.

9

10

```python { .api }

11

class Writer:

12

"""

13

High-level HDF5 writer optimized for large DataFrame export.

14

15

Provides streaming write capabilities using memory mapping for optimal

16

performance and supports parallel column writing.

17

"""

18

def __init__(self, path, group="/table", mode="w", byteorder="="):

19

"""

20

Initialize HDF5 writer.

21

22

Parameters:

23

- path: Output file path

24

- group: HDF5 group path for table data (default: "/table")

25

- mode: File open mode ("w" for write, "a" for append)

26

- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)

27

"""

28

29

def layout(self, df, progress=None):

30

"""

31

Set up file layout and allocate space for DataFrame.

32

33

This must be called before write(). Analyzes the DataFrame schema,

34

calculates storage requirements, and pre-allocates HDF5 datasets.

35

36

Parameters:

37

- df: DataFrame to analyze and prepare for writing

38

- progress: Progress callback for layout operations

39

40

Raises:

41

AssertionError: If layout() called twice

42

ValueError: If DataFrame is empty

43

"""

44

45

def write(self, df, chunk_size=100000, parallel=True, progress=None,

46

column_count=1, export_threads=0):

47

"""

48

Write DataFrame data to HDF5 file.

49

50

Streams data in chunks using memory mapping for optimal performance.

51

52

Parameters:

53

- df: DataFrame to write (must match layout() DataFrame)

54

- chunk_size: Number of rows to process per chunk (rounded to multiple of 8)

55

- parallel: Enable parallel processing within vaex

56

- progress: Progress callback for write operations

57

- column_count: Number of columns to process simultaneously

58

- export_threads: Number of threads for column writing (0 for single-threaded)

59

60

Raises:

61

AssertionError: If layout() not called first

62

ValueError: If DataFrame is empty

63

"""

64

65

def close(self):

66

"""

67

Close writer and clean up resources.

68

69

Closes memory maps, file handles, and HDF5 file.

70

Must be called when finished writing.

71

"""

72

73

def __enter__(self):

74

"""Context manager entry."""

75

76

def __exit__(self, *args):

77

"""Context manager exit - automatically calls close()."""

78

```

79

80

### Column Writers

81

82

Specialized writers for different data types with optimized storage strategies.

83

84

#### Dictionary-Encoded Column Writer

85

86

```python { .api }

87

class ColumnWriterDictionaryEncoded:

88

"""

89

Writer for dictionary-encoded (categorical) columns.

90

91

Stores unique values in a dictionary and indices separately

92

for efficient storage of categorical data.

93

"""

94

def __init__(self, h5parent, name, dtype, values, shape, has_null, byteorder="=", df=None):

95

"""

96

Initialize dictionary-encoded column writer.

97

98

Parameters:

99

- h5parent: Parent HDF5 group

100

- name: Column name

101

- dtype: Dictionary-encoded data type

102

- values: Array of unique category values

103

- shape: Column shape (rows, ...)

104

- has_null: Whether column contains null values

105

- byteorder: Byte order for numeric data

106

- df: Source DataFrame for extracting index values

107

108

Raises:

109

ValueError: If encoded index contains null values

110

"""

111

112

def mmap(self, mmap, file):

113

"""Set up memory mapping for writing."""

114

115

def write(self, values):

116

"""Write index values for a chunk of data."""

117

118

def write_extra(self):

119

"""Write the dictionary values."""

120

121

@property

122

def progress(self):

123

"""Get writing progress as fraction (0-1)."""

124

```

125

126

#### Primitive Column Writer

127

128

```python { .api }

129

class ColumnWriterPrimitive:

130

"""

131

Writer for primitive data types (numeric, datetime, boolean).

132

133

Handles standard data types with optional null bitmaps and masks.

134

"""

135

def __init__(self, h5parent, name, dtype, shape, has_mask, has_null, byteorder="="):

136

"""

137

Initialize primitive column writer.

138

139

Parameters:

140

- h5parent: Parent HDF5 group

141

- name: Column name

142

- dtype: Data type

143

- shape: Column shape (rows, ...)

144

- has_mask: Whether column has a mask array

145

- has_null: Whether column has null values (Arrow format)

146

- byteorder: Byte order

147

148

Raises:

149

ValueError: If both has_mask and has_null are True

150

"""

151

152

def mmap(self, mmap, file):

153

"""Set up memory mapping for writing."""

154

155

def write(self, values):

156

"""Write a chunk of values."""

157

158

def write_extra(self):

159

"""Write any extra data (no-op for primitives)."""

160

161

@property

162

def progress(self):

163

"""Get writing progress as fraction (0-1)."""

164

```

165

166

#### String Column Writer

167

168

```python { .api }

169

class ColumnWriterString:

170

"""

171

Writer for variable-length string columns.

172

173

Uses efficient Arrow-style storage with separate data and index arrays.

174

"""

175

def __init__(self, h5parent, name, dtype, shape, byte_length, has_null):

176

"""

177

Initialize string column writer.

178

179

Parameters:

180

- h5parent: Parent HDF5 group

181

- name: Column name

182

- dtype: String data type

183

- shape: Column shape (rows,)

184

- byte_length: Total bytes needed for all strings

185

- has_null: Whether column contains null values

186

"""

187

188

def mmap(self, mmap, file):

189

"""Set up memory mapping for writing."""

190

191

def write(self, values):

192

"""Write a chunk of string values."""

193

194

def write_extra(self):

195

"""Write any extra data (no-op for strings)."""

196

197

@property

198

def progress(self):

199

"""Get writing progress as fraction (0-1)."""

200

```

201

202

## Usage Examples

203

204

### Basic High-Performance Writing

205

206

```python

207

from vaex.hdf5.writer import Writer

208

import vaex

209

210

# Load large DataFrame

211

df = vaex.open('large_dataset.parquet')

212

213

# Context manager ensures proper cleanup

214

with Writer('output.hdf5') as writer:

215

writer.layout(df)

216

writer.write(df)

217

```

218

219

### Advanced Writing Configuration

220

221

```python

222

with Writer('output.hdf5', group='/data', byteorder='<') as writer:

223

# Set up layout with progress tracking

224

def layout_progress(fraction):

225

print(f"Layout progress: {fraction*100:.1f}%")

226

227

writer.layout(df, progress=layout_progress)

228

229

# Write with custom settings

230

def write_progress(fraction):

231

print(f"Write progress: {fraction*100:.1f}%")

232

233

writer.write(df,

234

chunk_size=50000, # Smaller chunks

235

parallel=True, # Enable parallel processing

236

progress=write_progress,

237

column_count=2, # Process 2 columns at once

238

export_threads=4) # Use 4 writer threads

239

```

240

241

### Memory-Constrained Writing

242

243

```python

244

# For very large datasets with limited memory

245

with Writer('huge_output.hdf5') as writer:

246

writer.layout(df)

247

248

# Use smaller chunks and disable threading

249

writer.write(df,

250

chunk_size=10000,

251

parallel=False,

252

export_threads=0)

253

```

254

255

### Writing Subsets

256

257

```python

258

# Write filtered DataFrame

259

df_filtered = df[df.score > 0.8]

260

261

with Writer('filtered_output.hdf5') as writer:

262

writer.layout(df_filtered)

263

writer.write(df_filtered)

264

```

265

266

### Manual Resource Management

267

268

```python

269

# Without context manager (not recommended)

270

writer = Writer('output.hdf5')

271

try:

272

writer.layout(df)

273

writer.write(df)

274

finally:

275

writer.close() # Always close to prevent resource leaks

276

```

277

278

## Performance Optimization

279

280

### Chunk Size Selection

281

282

Choose chunk size based on available memory and data characteristics:

283

284

```python

285

# For large datasets with sufficient memory

286

writer.write(df, chunk_size=1000000) # 1M rows per chunk

287

288

# For memory-constrained environments

289

writer.write(df, chunk_size=10000) # 10K rows per chunk

290

291

# Chunk size is automatically rounded to multiple of 8

292

```

293

294

### Parallel Processing

295

296

Configure parallelism based on system resources:

297

298

```python

299

# CPU-intensive workloads

300

writer.write(df, parallel=True, export_threads=4)

301

302

# I/O-intensive workloads

303

writer.write(df, parallel=True, export_threads=1)

304

305

# Single-threaded for debugging

306

writer.write(df, parallel=False, export_threads=0)

307

```

308

309

### Memory Mapping

310

311

The writer automatically uses memory mapping when possible:

312

313

- Enables zero-copy operations for maximum performance

314

- Controlled by `VAEX_USE_MMAP` environment variable

315

- Falls back to regular I/O for incompatible storage systems

316

317

## Constants

318

319

```python { .api }

320

USE_MMAP = True # Environment variable VAEX_USE_MMAP (default: True)

321

max_int32 = 2147483647 # Maximum 32-bit integer value

322

```

323

324

## Data Type Handling

325

326

The writer automatically selects appropriate column writers:

327

328

- **Primitive types**`ColumnWriterPrimitive`

329

- **String types**`ColumnWriterString`

330

- **Dictionary-encoded**`ColumnWriterDictionaryEncoded`

331

- **Unsupported types** → Raises `TypeError`

332

333

## Storage Layout

334

335

The writer creates HDF5 files with this structure:

336

337

```

338

/table/ # Main table group

339

├── columns/ # Column data group

340

│ ├── column1/ # Individual column group

341

│ │ └── data # Column data array

342

│ ├── column2/ # String column example

343

│ │ ├── data # String bytes

344

│ │ ├── indices # String offsets

345

│ │ └── null_bitmap # Null value bitmap (if needed)

346

│ └── column3/ # Dictionary-encoded example

347

│ ├── indices/ # Category indices

348

│ │ └── data

349

│ └── dictionary/ # Category values

350

│ └── data

351

└── @attrs

352

├── type: "table"

353

└── column_order: "column1,column2,column3"

354

```

355

356

## Error Handling

357

358

Writer classes may raise:

359

360

- `AssertionError`: If methods called in wrong order

361

- `ValueError`: For invalid parameters or empty DataFrames

362

- `TypeError`: For unsupported data types

363

- `OSError`: For file system errors

364

- `h5py.H5Error`: For HDF5 writing errors

365

- `MemoryError`: If insufficient memory for operations