or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-export.mddataset-reading.mdhigh-performance-writing.mdindex.mdmemory-mapping-utils.md

index.mddocs/

0

# vaex-hdf5

1

2

HDF5 file support for the Vaex high-performance Python library that enables lazy out-of-core DataFrame operations on large datasets. It offers memory-mapped HDF5 file reading capabilities with zero-copy access patterns, supports various HDF5 dataset formats including scientific data from Gadget simulations and AMUSE astrophysics framework, and provides efficient data export functionality to HDF5 format.

3

4

## Package Information

5

6

- **Package Name**: vaex-hdf5

7

- **Language**: Python

8

- **Installation**: `pip install vaex-hdf5`

9

- **Dependencies**: h5py>=2.9, vaex-core>=4.0.0,<5

10

11

## Core Imports

12

13

```python

14

import vaex.hdf5.dataset

15

import vaex.hdf5.export

16

import vaex.hdf5.writer

17

import vaex.hdf5.utils

18

```

19

20

For direct dataset access:

21

22

```python

23

from vaex.hdf5.dataset import Hdf5MemoryMapped, AmuseHdf5MemoryMapped, Hdf5MemoryMappedGadget

24

```

25

26

## Basic Usage

27

28

```python

29

import vaex

30

31

# Reading HDF5 files (automatic detection via vaex.open)

32

df = vaex.open('data.hdf5')

33

34

# Reading specialized formats

35

df_amuse = vaex.open('simulation.hdf5') # AMUSE format auto-detected

36

df_gadget = vaex.open('snapshot.hdf5#0') # Gadget format with particle type

37

38

# Exporting to HDF5

39

df = vaex.from_csv('data.csv')

40

df.export('output.hdf5')

41

42

# Manual dataset creation

43

from vaex.hdf5.dataset import Hdf5MemoryMapped

44

dataset = Hdf5MemoryMapped.create('new_file.hdf5', N=1000,

45

column_names=['x', 'y', 'z'])

46

47

# High-performance writing with Writer

48

from vaex.hdf5.writer import Writer

49

with Writer('output.hdf5') as writer:

50

writer.layout(df)

51

writer.write(df)

52

```

53

54

## Architecture

55

56

The vaex-hdf5 package is built around several key components:

57

58

- **Dataset Readers**: Memory-mapped HDF5 dataset classes that provide zero-copy access to data

59

- **Export Functions**: High-level functions for exporting vaex DataFrames to HDF5 format

60

- **Writer Classes**: Low-level writers for efficient streaming data export

61

- **Entry Points**: Automatic format detection and registration with vaex core

62

63

The package integrates seamlessly with the broader Vaex ecosystem through entry points that register HDF5 dataset openers, enabling automatic format detection and optimal performance for billion-row datasets through lazy evaluation and memory mapping techniques.

64

65

## Capabilities

66

67

### HDF5 Dataset Reading

68

69

Memory-mapped reading of HDF5 files with support for standard vaex format, AMUSE scientific data format, and Gadget2 simulation format. Provides zero-copy access patterns and automatic format detection.

70

71

```python { .api }

72

class Hdf5MemoryMapped:

73

def __init__(self, path, write=False, fs_options={}, fs=None, nommap=None, group=None, _fingerprint=None): ...

74

@classmethod

75

def create(cls, path, N, column_names, dtypes=None, write=True): ...

76

@classmethod

77

def can_open(cls, path, fs_options={}, fs=None, group=None, **kwargs): ...

78

def write_meta(self): ...

79

def close(self): ...

80

81

class AmuseHdf5MemoryMapped(Hdf5MemoryMapped):

82

def __init__(self, path, write=False, fs_options={}, fs=None): ...

83

84

class Hdf5MemoryMappedGadget(DatasetMemoryMapped):

85

def __init__(self, path, particle_name=None, particle_type=None, fs_options={}, fs=None): ...

86

```

87

88

[HDF5 Dataset Reading](./dataset-reading.md)

89

90

### Data Export

91

92

High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, compression options, and streaming export for large datasets.

93

94

```python { .api }

95

def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False,

96

selection=False, progress=None, virtual=True, sort=None,

97

ascending=True, parallel=True): ...

98

99

def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False,

100

selection=False, progress=None, virtual=True): ...

101

```

102

103

[Data Export](./data-export.md)

104

105

### High-Performance Writing

106

107

Low-level writer classes for streaming large datasets to HDF5 format with optimal memory usage, parallel writing support, and specialized column writers for different data types.

108

109

```python { .api }

110

class Writer:

111

def __init__(self, path, group="/table", mode="w", byteorder="="): ...

112

def layout(self, df, progress=None): ...

113

def write(self, df, chunk_size=int(1e5), parallel=True, progress=None,

114

column_count=1, export_threads=0): ...

115

def close(self): ...

116

def __enter__(self): ...

117

def __exit__(self, *args): ...

118

```

119

120

[High-Performance Writing](./high-performance-writing.md)

121

122

### Memory Mapping Utilities

123

124

Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays and different storage layouts.

125

126

```python { .api }

127

def mmap_array(mmap, file, offset, dtype, shape): ...

128

def h5mmap(mmap, file, data, mask=None): ...

129

```

130

131

[Memory Mapping Utilities](./memory-mapping-utils.md)

132

133

## Types

134

135

```python { .api }

136

# Common type aliases used throughout the API

137

PathLike = Union[str, Path]

138

FileSystemOptions = Dict[str, Any]

139

FileSystem = Any # fsspec filesystem

140

ProgressCallback = Callable[[float], bool]

141

ByteOrder = Literal["=", "<", ">"]

142

```