0
# vaex-hdf5
1
2
HDF5 file support for the Vaex high-performance Python library that enables lazy out-of-core DataFrame operations on large datasets. It offers memory-mapped HDF5 file reading capabilities with zero-copy access patterns, supports various HDF5 dataset formats including scientific data from Gadget simulations and AMUSE astrophysics framework, and provides efficient data export functionality to HDF5 format.
3
4
## Package Information
5
6
- **Package Name**: vaex-hdf5
7
- **Language**: Python
8
- **Installation**: `pip install vaex-hdf5`
9
- **Dependencies**: h5py>=2.9, vaex-core>=4.0.0,<5
10
11
## Core Imports
12
13
```python
14
import vaex.hdf5.dataset
15
import vaex.hdf5.export
16
import vaex.hdf5.writer
17
import vaex.hdf5.utils
18
```
19
20
For direct dataset access:
21
22
```python
23
from vaex.hdf5.dataset import Hdf5MemoryMapped, AmuseHdf5MemoryMapped, Hdf5MemoryMappedGadget
24
```
25
26
## Basic Usage
27
28
```python
29
import vaex
30
31
# Reading HDF5 files (automatic detection via vaex.open)
32
df = vaex.open('data.hdf5')
33
34
# Reading specialized formats
35
df_amuse = vaex.open('simulation.hdf5') # AMUSE format auto-detected
36
df_gadget = vaex.open('snapshot.hdf5#0') # Gadget format with particle type
37
38
# Exporting to HDF5
39
df = vaex.from_csv('data.csv')
40
df.export('output.hdf5')
41
42
# Manual dataset creation
43
from vaex.hdf5.dataset import Hdf5MemoryMapped
44
dataset = Hdf5MemoryMapped.create('new_file.hdf5', N=1000,
45
column_names=['x', 'y', 'z'])
46
47
# High-performance writing with Writer
48
from vaex.hdf5.writer import Writer
49
with Writer('output.hdf5') as writer:
50
writer.layout(df)
51
writer.write(df)
52
```
53
54
## Architecture
55
56
The vaex-hdf5 package is built around several key components:
57
58
- **Dataset Readers**: Memory-mapped HDF5 dataset classes that provide zero-copy access to data
59
- **Export Functions**: High-level functions for exporting vaex DataFrames to HDF5 format
60
- **Writer Classes**: Low-level writers for efficient streaming data export
61
- **Entry Points**: Automatic format detection and registration with vaex core
62
63
The package integrates seamlessly with the broader Vaex ecosystem through entry points that register HDF5 dataset openers, enabling automatic format detection and optimal performance for billion-row datasets through lazy evaluation and memory mapping techniques.
64
65
## Capabilities
66
67
### HDF5 Dataset Reading
68
69
Memory-mapped reading of HDF5 files with support for standard vaex format, AMUSE scientific data format, and Gadget2 simulation format. Provides zero-copy access patterns and automatic format detection.
70
71
```python { .api }
72
class Hdf5MemoryMapped:
73
def __init__(self, path, write=False, fs_options={}, fs=None, nommap=None, group=None, _fingerprint=None): ...
74
@classmethod
75
def create(cls, path, N, column_names, dtypes=None, write=True): ...
76
@classmethod
77
def can_open(cls, path, fs_options={}, fs=None, group=None, **kwargs): ...
78
def write_meta(self): ...
79
def close(self): ...
80
81
class AmuseHdf5MemoryMapped(Hdf5MemoryMapped):
82
def __init__(self, path, write=False, fs_options={}, fs=None): ...
83
84
class Hdf5MemoryMappedGadget(DatasetMemoryMapped):
85
def __init__(self, path, particle_name=None, particle_type=None, fs_options={}, fs=None): ...
86
```
87
88
[HDF5 Dataset Reading](./dataset-reading.md)
89
90
### Data Export
91
92
High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, compression options, and streaming export for large datasets.
93
94
```python { .api }
95
def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False,
96
selection=False, progress=None, virtual=True, sort=None,
97
ascending=True, parallel=True): ...
98
99
def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False,
100
selection=False, progress=None, virtual=True): ...
101
```
102
103
[Data Export](./data-export.md)
104
105
### High-Performance Writing
106
107
Low-level writer classes for streaming large datasets to HDF5 format with optimal memory usage, parallel writing support, and specialized column writers for different data types.
108
109
```python { .api }
110
class Writer:
111
def __init__(self, path, group="/table", mode="w", byteorder="="): ...
112
def layout(self, df, progress=None): ...
113
def write(self, df, chunk_size=int(1e5), parallel=True, progress=None,
114
column_count=1, export_threads=0): ...
115
def close(self): ...
116
def __enter__(self): ...
117
def __exit__(self, *args): ...
118
```
119
120
[High-Performance Writing](./high-performance-writing.md)
121
122
### Memory Mapping Utilities
123
124
Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays and different storage layouts.
125
126
```python { .api }
127
def mmap_array(mmap, file, offset, dtype, shape): ...
128
def h5mmap(mmap, file, data, mask=None): ...
129
```
130
131
[Memory Mapping Utilities](./memory-mapping-utils.md)
132
133
## Types
134
135
```python { .api }
136
# Common type aliases used throughout the API
137
PathLike = Union[str, Path]
138
FileSystemOptions = Dict[str, Any]
139
FileSystem = Any # fsspec filesystem
140
ProgressCallback = Callable[[float], bool]
141
ByteOrder = Literal["=", "<", ">"]
142
```