Tessl Tile for pypi/vaex-hdf5@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-export.md dataset-reading.md high-performance-writing.md index.md memory-mapping-utils.md

high-performance-writing.mddocs/

0
# High-Performance Writing
1

2
Low-level writer classes for streaming large datasets to HDF5 format with optimal memory usage, parallel writing support, and specialized column writers for different data types.
3

4
## Capabilities
5

6
### Main Writer Class
7

8
The primary interface for high-performance HDF5 writing with memory mapping and parallel processing support.
9

10
```python { .api }
11
class Writer:
12
    """
13
    High-level HDF5 writer optimized for large DataFrame export.
14
    
15
    Provides streaming write capabilities using memory mapping for optimal
16
    performance and supports parallel column writing.
17
    """
18
    def __init__(self, path, group="/table", mode="w", byteorder="="):
19
        """
20
        Initialize HDF5 writer.
21
        
22
        Parameters:
23
        - path: Output file path
24
        - group: HDF5 group path for table data (default: "/table")
25
        - mode: File open mode ("w" for write, "a" for append)
26
        - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
27
        """
28

29
    def layout(self, df, progress=None):
30
        """
31
        Set up file layout and allocate space for DataFrame.
32
        
33
        This must be called before write(). Analyzes the DataFrame schema,
34
        calculates storage requirements, and pre-allocates HDF5 datasets.
35
        
36
        Parameters:
37
        - df: DataFrame to analyze and prepare for writing
38
        - progress: Progress callback for layout operations
39
        
40
        Raises:
41
        AssertionError: If layout() called twice
42
        ValueError: If DataFrame is empty
43
        """
44

45
    def write(self, df, chunk_size=100000, parallel=True, progress=None, 
46
              column_count=1, export_threads=0):
47
        """
48
        Write DataFrame data to HDF5 file.
49
        
50
        Streams data in chunks using memory mapping for optimal performance.
51
        
52
        Parameters:
53
        - df: DataFrame to write (must match layout() DataFrame)
54
        - chunk_size: Number of rows to process per chunk (rounded to multiple of 8)
55
        - parallel: Enable parallel processing within vaex
56
        - progress: Progress callback for write operations
57
        - column_count: Number of columns to process simultaneously
58
        - export_threads: Number of threads for column writing (0 for single-threaded)
59
        
60
        Raises:
61
        AssertionError: If layout() not called first
62
        ValueError: If DataFrame is empty
63
        """
64

65
    def close(self):
66
        """
67
        Close writer and clean up resources.
68
        
69
        Closes memory maps, file handles, and HDF5 file.
70
        Must be called when finished writing.
71
        """
72

73
    def __enter__(self):
74
        """Context manager entry."""
75
        
76
    def __exit__(self, *args):
77
        """Context manager exit - automatically calls close()."""
78
```
79

80
### Column Writers
81

82
Specialized writers for different data types with optimized storage strategies.
83

84
#### Dictionary-Encoded Column Writer
85

86
```python { .api }
87
class ColumnWriterDictionaryEncoded:
88
    """
89
    Writer for dictionary-encoded (categorical) columns.
90
    
91
    Stores unique values in a dictionary and indices separately
92
    for efficient storage of categorical data.
93
    """
94
    def __init__(self, h5parent, name, dtype, values, shape, has_null, byteorder="=", df=None):
95
        """
96
        Initialize dictionary-encoded column writer.
97
        
98
        Parameters:
99
        - h5parent: Parent HDF5 group
100
        - name: Column name
101
        - dtype: Dictionary-encoded data type
102
        - values: Array of unique category values
103
        - shape: Column shape (rows, ...)
104
        - has_null: Whether column contains null values
105
        - byteorder: Byte order for numeric data
106
        - df: Source DataFrame for extracting index values
107
        
108
        Raises:
109
        ValueError: If encoded index contains null values
110
        """
111

112
    def mmap(self, mmap, file):
113
        """Set up memory mapping for writing."""
114
        
115
    def write(self, values):
116
        """Write index values for a chunk of data."""
117
        
118
    def write_extra(self):
119
        """Write the dictionary values."""
120
        
121
    @property
122
    def progress(self):
123
        """Get writing progress as fraction (0-1)."""
124
```
125

126
#### Primitive Column Writer
127

128
```python { .api }
129
class ColumnWriterPrimitive:
130
    """
131
    Writer for primitive data types (numeric, datetime, boolean).
132
    
133
    Handles standard data types with optional null bitmaps and masks.
134
    """
135
    def __init__(self, h5parent, name, dtype, shape, has_mask, has_null, byteorder="="):
136
        """
137
        Initialize primitive column writer.
138
        
139
        Parameters:
140
        - h5parent: Parent HDF5 group
141
        - name: Column name  
142
        - dtype: Data type
143
        - shape: Column shape (rows, ...)
144
        - has_mask: Whether column has a mask array
145
        - has_null: Whether column has null values (Arrow format)
146
        - byteorder: Byte order
147
        
148
        Raises:
149
        ValueError: If both has_mask and has_null are True
150
        """
151

152
    def mmap(self, mmap, file):
153
        """Set up memory mapping for writing."""
154
        
155
    def write(self, values):
156
        """Write a chunk of values."""
157
        
158
    def write_extra(self):
159
        """Write any extra data (no-op for primitives)."""
160
        
161
    @property
162
    def progress(self):
163
        """Get writing progress as fraction (0-1)."""
164
```
165

166
#### String Column Writer
167

168
```python { .api }
169
class ColumnWriterString:
170
    """
171
    Writer for variable-length string columns.
172
    
173
    Uses efficient Arrow-style storage with separate data and index arrays.
174
    """
175
    def __init__(self, h5parent, name, dtype, shape, byte_length, has_null):
176
        """
177
        Initialize string column writer.
178
        
179
        Parameters:
180
        - h5parent: Parent HDF5 group
181
        - name: Column name
182
        - dtype: String data type
183
        - shape: Column shape (rows,)
184
        - byte_length: Total bytes needed for all strings
185
        - has_null: Whether column contains null values
186
        """
187

188
    def mmap(self, mmap, file):
189
        """Set up memory mapping for writing."""
190
        
191
    def write(self, values):
192
        """Write a chunk of string values."""
193
        
194
    def write_extra(self):
195
        """Write any extra data (no-op for strings)."""
196
        
197
    @property
198
    def progress(self):
199
        """Get writing progress as fraction (0-1)."""
200
```
201

202
## Usage Examples
203

204
### Basic High-Performance Writing
205

206
```python
207
from vaex.hdf5.writer import Writer
208
import vaex
209

210
# Load large DataFrame
211
df = vaex.open('large_dataset.parquet')
212

213
# Context manager ensures proper cleanup
214
with Writer('output.hdf5') as writer:
215
    writer.layout(df)
216
    writer.write(df)
217
```
218

219
### Advanced Writing Configuration
220

221
```python
222
with Writer('output.hdf5', group='/data', byteorder='<') as writer:
223
    # Set up layout with progress tracking
224
    def layout_progress(fraction):
225
        print(f"Layout progress: {fraction*100:.1f}%")
226
    
227
    writer.layout(df, progress=layout_progress)
228
    
229
    # Write with custom settings
230
    def write_progress(fraction):
231
        print(f"Write progress: {fraction*100:.1f}%")
232
    
233
    writer.write(df, 
234
                 chunk_size=50000,      # Smaller chunks
235
                 parallel=True,         # Enable parallel processing
236
                 progress=write_progress,
237
                 column_count=2,        # Process 2 columns at once
238
                 export_threads=4)      # Use 4 writer threads
239
```
240

241
### Memory-Constrained Writing
242

243
```python
244
# For very large datasets with limited memory
245
with Writer('huge_output.hdf5') as writer:
246
    writer.layout(df)
247
    
248
    # Use smaller chunks and disable threading
249
    writer.write(df, 
250
                 chunk_size=10000,
251
                 parallel=False,
252
                 export_threads=0)
253
```
254

255
### Writing Subsets
256

257
```python
258
# Write filtered DataFrame
259
df_filtered = df[df.score > 0.8]
260

261
with Writer('filtered_output.hdf5') as writer:
262
    writer.layout(df_filtered)
263
    writer.write(df_filtered)
264
```
265

266
### Manual Resource Management
267

268
```python
269
# Without context manager (not recommended)
270
writer = Writer('output.hdf5')
271
try:
272
    writer.layout(df)
273
    writer.write(df)
274
finally:
275
    writer.close()  # Always close to prevent resource leaks
276
```
277

278
## Performance Optimization
279

280
### Chunk Size Selection
281

282
Choose chunk size based on available memory and data characteristics:
283

284
```python
285
# For large datasets with sufficient memory
286
writer.write(df, chunk_size=1000000)  # 1M rows per chunk
287

288
# For memory-constrained environments
289
writer.write(df, chunk_size=10000)    # 10K rows per chunk
290

291
# Chunk size is automatically rounded to multiple of 8
292
```
293

294
### Parallel Processing
295

296
Configure parallelism based on system resources:
297

298
```python
299
# CPU-intensive workloads
300
writer.write(df, parallel=True, export_threads=4)
301

302
# I/O-intensive workloads  
303
writer.write(df, parallel=True, export_threads=1)
304

305
# Single-threaded for debugging
306
writer.write(df, parallel=False, export_threads=0)
307
```
308

309
### Memory Mapping
310

311
The writer automatically uses memory mapping when possible:
312

313
- Enables zero-copy operations for maximum performance
314
- Controlled by `VAEX_USE_MMAP` environment variable
315
- Falls back to regular I/O for incompatible storage systems
316

317
## Constants
318

319
```python { .api }
320
USE_MMAP = True  # Environment variable VAEX_USE_MMAP (default: True)
321
max_int32 = 2147483647  # Maximum 32-bit integer value
322
```
323

324
## Data Type Handling
325

326
The writer automatically selects appropriate column writers:
327

328
- **Primitive types** → `ColumnWriterPrimitive`
329
- **String types** → `ColumnWriterString`  
330
- **Dictionary-encoded** → `ColumnWriterDictionaryEncoded`
331
- **Unsupported types** → Raises `TypeError`
332

333
## Storage Layout
334

335
The writer creates HDF5 files with this structure:
336

337
```
338
/table/                    # Main table group
339
├── columns/              # Column data group
340
│   ├── column1/         # Individual column group
341
│   │   └── data         # Column data array
342
│   ├── column2/         # String column example
343
│   │   ├── data         # String bytes
344
│   │   ├── indices      # String offsets
345
│   │   └── null_bitmap  # Null value bitmap (if needed)
346
│   └── column3/         # Dictionary-encoded example
347
│       ├── indices/     # Category indices
348
│       │   └── data
349
│       └── dictionary/  # Category values
350
│           └── data
351
└── @attrs
352
    ├── type: "table"
353
    └── column_order: "column1,column2,column3"
354
```
355

356
## Error Handling
357

358
Writer classes may raise:
359

360
- `AssertionError`: If methods called in wrong order
361
- `ValueError`: For invalid parameters or empty DataFrames
362
- `TypeError`: For unsupported data types
363
- `OSError`: For file system errors
364
- `h5py.H5Error`: For HDF5 writing errors
365
- `MemoryError`: If insufficient memory for operations

Version

Tile

Files

high-performance-writing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

high-performance-writing.mddocs/