0
# High-Performance Writing
1
2
Low-level writer classes for streaming large datasets to HDF5 format with optimal memory usage, parallel writing support, and specialized column writers for different data types.
3
4
## Capabilities
5
6
### Main Writer Class
7
8
The primary interface for high-performance HDF5 writing with memory mapping and parallel processing support.
9
10
```python { .api }
11
class Writer:
12
"""
13
High-level HDF5 writer optimized for large DataFrame export.
14
15
Provides streaming write capabilities using memory mapping for optimal
16
performance and supports parallel column writing.
17
"""
18
def __init__(self, path, group="/table", mode="w", byteorder="="):
19
"""
20
Initialize HDF5 writer.
21
22
Parameters:
23
- path: Output file path
24
- group: HDF5 group path for table data (default: "/table")
25
- mode: File open mode ("w" for write, "a" for append)
26
- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
27
"""
28
29
def layout(self, df, progress=None):
30
"""
31
Set up file layout and allocate space for DataFrame.
32
33
This must be called before write(). Analyzes the DataFrame schema,
34
calculates storage requirements, and pre-allocates HDF5 datasets.
35
36
Parameters:
37
- df: DataFrame to analyze and prepare for writing
38
- progress: Progress callback for layout operations
39
40
Raises:
41
AssertionError: If layout() called twice
42
ValueError: If DataFrame is empty
43
"""
44
45
def write(self, df, chunk_size=100000, parallel=True, progress=None,
46
column_count=1, export_threads=0):
47
"""
48
Write DataFrame data to HDF5 file.
49
50
Streams data in chunks using memory mapping for optimal performance.
51
52
Parameters:
53
- df: DataFrame to write (must match layout() DataFrame)
54
- chunk_size: Number of rows to process per chunk (rounded to multiple of 8)
55
- parallel: Enable parallel processing within vaex
56
- progress: Progress callback for write operations
57
- column_count: Number of columns to process simultaneously
58
- export_threads: Number of threads for column writing (0 for single-threaded)
59
60
Raises:
61
AssertionError: If layout() not called first
62
ValueError: If DataFrame is empty
63
"""
64
65
def close(self):
66
"""
67
Close writer and clean up resources.
68
69
Closes memory maps, file handles, and HDF5 file.
70
Must be called when finished writing.
71
"""
72
73
def __enter__(self):
74
"""Context manager entry."""
75
76
def __exit__(self, *args):
77
"""Context manager exit - automatically calls close()."""
78
```
79
80
### Column Writers
81
82
Specialized writers for different data types with optimized storage strategies.
83
84
#### Dictionary-Encoded Column Writer
85
86
```python { .api }
87
class ColumnWriterDictionaryEncoded:
88
"""
89
Writer for dictionary-encoded (categorical) columns.
90
91
Stores unique values in a dictionary and indices separately
92
for efficient storage of categorical data.
93
"""
94
def __init__(self, h5parent, name, dtype, values, shape, has_null, byteorder="=", df=None):
95
"""
96
Initialize dictionary-encoded column writer.
97
98
Parameters:
99
- h5parent: Parent HDF5 group
100
- name: Column name
101
- dtype: Dictionary-encoded data type
102
- values: Array of unique category values
103
- shape: Column shape (rows, ...)
104
- has_null: Whether column contains null values
105
- byteorder: Byte order for numeric data
106
- df: Source DataFrame for extracting index values
107
108
Raises:
109
ValueError: If encoded index contains null values
110
"""
111
112
def mmap(self, mmap, file):
113
"""Set up memory mapping for writing."""
114
115
def write(self, values):
116
"""Write index values for a chunk of data."""
117
118
def write_extra(self):
119
"""Write the dictionary values."""
120
121
@property
122
def progress(self):
123
"""Get writing progress as fraction (0-1)."""
124
```
125
126
#### Primitive Column Writer
127
128
```python { .api }
129
class ColumnWriterPrimitive:
130
"""
131
Writer for primitive data types (numeric, datetime, boolean).
132
133
Handles standard data types with optional null bitmaps and masks.
134
"""
135
def __init__(self, h5parent, name, dtype, shape, has_mask, has_null, byteorder="="):
136
"""
137
Initialize primitive column writer.
138
139
Parameters:
140
- h5parent: Parent HDF5 group
141
- name: Column name
142
- dtype: Data type
143
- shape: Column shape (rows, ...)
144
- has_mask: Whether column has a mask array
145
- has_null: Whether column has null values (Arrow format)
146
- byteorder: Byte order
147
148
Raises:
149
ValueError: If both has_mask and has_null are True
150
"""
151
152
def mmap(self, mmap, file):
153
"""Set up memory mapping for writing."""
154
155
def write(self, values):
156
"""Write a chunk of values."""
157
158
def write_extra(self):
159
"""Write any extra data (no-op for primitives)."""
160
161
@property
162
def progress(self):
163
"""Get writing progress as fraction (0-1)."""
164
```
165
166
#### String Column Writer
167
168
```python { .api }
169
class ColumnWriterString:
170
"""
171
Writer for variable-length string columns.
172
173
Uses efficient Arrow-style storage with separate data and index arrays.
174
"""
175
def __init__(self, h5parent, name, dtype, shape, byte_length, has_null):
176
"""
177
Initialize string column writer.
178
179
Parameters:
180
- h5parent: Parent HDF5 group
181
- name: Column name
182
- dtype: String data type
183
- shape: Column shape (rows,)
184
- byte_length: Total bytes needed for all strings
185
- has_null: Whether column contains null values
186
"""
187
188
def mmap(self, mmap, file):
189
"""Set up memory mapping for writing."""
190
191
def write(self, values):
192
"""Write a chunk of string values."""
193
194
def write_extra(self):
195
"""Write any extra data (no-op for strings)."""
196
197
@property
198
def progress(self):
199
"""Get writing progress as fraction (0-1)."""
200
```
201
202
## Usage Examples
203
204
### Basic High-Performance Writing
205
206
```python
207
from vaex.hdf5.writer import Writer
208
import vaex
209
210
# Load large DataFrame
211
df = vaex.open('large_dataset.parquet')
212
213
# Context manager ensures proper cleanup
214
with Writer('output.hdf5') as writer:
215
writer.layout(df)
216
writer.write(df)
217
```
218
219
### Advanced Writing Configuration
220
221
```python
222
with Writer('output.hdf5', group='/data', byteorder='<') as writer:
223
# Set up layout with progress tracking
224
def layout_progress(fraction):
225
print(f"Layout progress: {fraction*100:.1f}%")
226
227
writer.layout(df, progress=layout_progress)
228
229
# Write with custom settings
230
def write_progress(fraction):
231
print(f"Write progress: {fraction*100:.1f}%")
232
233
writer.write(df,
234
chunk_size=50000, # Smaller chunks
235
parallel=True, # Enable parallel processing
236
progress=write_progress,
237
column_count=2, # Process 2 columns at once
238
export_threads=4) # Use 4 writer threads
239
```
240
241
### Memory-Constrained Writing
242
243
```python
244
# For very large datasets with limited memory
245
with Writer('huge_output.hdf5') as writer:
246
writer.layout(df)
247
248
# Use smaller chunks and disable threading
249
writer.write(df,
250
chunk_size=10000,
251
parallel=False,
252
export_threads=0)
253
```
254
255
### Writing Subsets
256
257
```python
258
# Write filtered DataFrame
259
df_filtered = df[df.score > 0.8]
260
261
with Writer('filtered_output.hdf5') as writer:
262
writer.layout(df_filtered)
263
writer.write(df_filtered)
264
```
265
266
### Manual Resource Management
267
268
```python
269
# Without context manager (not recommended)
270
writer = Writer('output.hdf5')
271
try:
272
writer.layout(df)
273
writer.write(df)
274
finally:
275
writer.close() # Always close to prevent resource leaks
276
```
277
278
## Performance Optimization
279
280
### Chunk Size Selection
281
282
Choose chunk size based on available memory and data characteristics:
283
284
```python
285
# For large datasets with sufficient memory
286
writer.write(df, chunk_size=1000000) # 1M rows per chunk
287
288
# For memory-constrained environments
289
writer.write(df, chunk_size=10000) # 10K rows per chunk
290
291
# Chunk size is automatically rounded to multiple of 8
292
```
293
294
### Parallel Processing
295
296
Configure parallelism based on system resources:
297
298
```python
299
# CPU-intensive workloads
300
writer.write(df, parallel=True, export_threads=4)
301
302
# I/O-intensive workloads
303
writer.write(df, parallel=True, export_threads=1)
304
305
# Single-threaded for debugging
306
writer.write(df, parallel=False, export_threads=0)
307
```
308
309
### Memory Mapping
310
311
The writer automatically uses memory mapping when possible:
312
313
- Enables zero-copy operations for maximum performance
314
- Controlled by `VAEX_USE_MMAP` environment variable
315
- Falls back to regular I/O for incompatible storage systems
316
317
## Constants
318
319
```python { .api }
320
USE_MMAP = True # Environment variable VAEX_USE_MMAP (default: True)
321
max_int32 = 2147483647 # Maximum 32-bit integer value
322
```
323
324
## Data Type Handling
325
326
The writer automatically selects appropriate column writers:
327
328
- **Primitive types** → `ColumnWriterPrimitive`
329
- **String types** → `ColumnWriterString`
330
- **Dictionary-encoded** → `ColumnWriterDictionaryEncoded`
331
- **Unsupported types** → Raises `TypeError`
332
333
## Storage Layout
334
335
The writer creates HDF5 files with this structure:
336
337
```
338
/table/ # Main table group
339
├── columns/ # Column data group
340
│ ├── column1/ # Individual column group
341
│ │ └── data # Column data array
342
│ ├── column2/ # String column example
343
│ │ ├── data # String bytes
344
│ │ ├── indices # String offsets
345
│ │ └── null_bitmap # Null value bitmap (if needed)
346
│ └── column3/ # Dictionary-encoded example
347
│ ├── indices/ # Category indices
348
│ │ └── data
349
│ └── dictionary/ # Category values
350
│ └── data
351
└── @attrs
352
├── type: "table"
353
└── column_order: "column1,column2,column3"
354
```
355
356
## Error Handling
357
358
Writer classes may raise:
359
360
- `AssertionError`: If methods called in wrong order
361
- `ValueError`: For invalid parameters or empty DataFrames
362
- `TypeError`: For unsupported data types
363
- `OSError`: For file system errors
364
- `h5py.H5Error`: For HDF5 writing errors
365
- `MemoryError`: If insufficient memory for operations