or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-export.mddataset-reading.mdhigh-performance-writing.mdindex.mdmemory-mapping-utils.md

data-export.mddocs/

0

# Data Export

1

2

High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.

3

4

## Capabilities

5

6

### HDF5 Version 2 Export

7

8

The main export function supporting the latest HDF5 format with advanced features.

9

10

```python { .api }

11

def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False,

12

selection=False, progress=None, virtual=True, sort=None,

13

ascending=True, parallel=True):

14

"""

15

Export dataset to HDF5 version 2 format.

16

17

This is the recommended export function supporting all modern features

18

including parallel processing, sorting, and advanced data types.

19

20

Parameters:

21

- dataset: DatasetLocal instance to export

22

- path: Output file path (str)

23

- column_names: List of column names to export (None for all columns)

24

- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)

25

- shuffle: Export rows in random order (bool)

26

- selection: Export selection or all data (bool or selection name)

27

- progress: Progress callback function or True for default progress bar

28

- virtual: Export virtual columns (bool)

29

- sort: Column name to sort by (str)

30

- ascending: Sort in ascending order (bool)

31

- parallel: Use parallel processing (bool)

32

33

Raises:

34

ValueError: If dataset is empty (cannot export empty table)

35

"""

36

```

37

38

### HDF5 Version 1 Export

39

40

Legacy export function for compatibility with older vaex versions.

41

42

```python { .api }

43

def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False,

44

selection=False, progress=None, virtual=True):

45

"""

46

Export dataset to HDF5 version 1 format.

47

48

Legacy export function for compatibility. Use export_hdf5() for new projects.

49

50

Parameters:

51

- dataset: DatasetLocal instance to export

52

- path: Output file path (str)

53

- column_names: List of column names to export (None for all columns)

54

- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)

55

- shuffle: Export rows in random order (bool)

56

- selection: Export selection or all data (bool or selection name)

57

- progress: Progress callback function or True for default progress bar

58

- virtual: Export virtual columns (bool)

59

60

Raises:

61

ValueError: If dataset is empty (cannot export empty table)

62

"""

63

```

64

65

## Usage Examples

66

67

### Basic Export

68

69

```python

70

import vaex

71

72

# Load DataFrame

73

df = vaex.from_csv('input.csv')

74

75

# Simple export

76

vaex.hdf5.export.export_hdf5(df, 'output.hdf5')

77

78

# Export specific columns

79

vaex.hdf5.export.export_hdf5(df, 'output.hdf5',

80

column_names=['col1', 'col2', 'col3'])

81

```

82

83

### Export with Options

84

85

```python

86

# Export with progress tracking

87

def progress_callback(fraction):

88

print(f"Export progress: {fraction*100:.1f}%")

89

return True # Continue processing

90

91

vaex.hdf5.export.export_hdf5(df, 'output.hdf5',

92

progress=progress_callback)

93

94

# Export with built-in progress bar

95

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)

96

97

# Export with shuffled rows

98

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)

99

100

# Export selection only

101

df_filtered = df[df.score > 0.5] # Create selection

102

vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)

103

```

104

105

### Export with Sorting

106

107

```python

108

# Sort by column during export

109

vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',

110

sort='timestamp', ascending=True)

111

112

# Sort in descending order

113

vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',

114

sort='score', ascending=False)

115

```

116

117

### Export Configuration

118

119

```python

120

# Big endian byte order

121

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')

122

123

# Disable parallel processing

124

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)

125

126

# Include virtual columns

127

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)

128

129

# Exclude virtual columns

130

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)

131

```

132

133

### Using DataFrame Export Method

134

135

```python

136

# DataFrames have export method that calls these functions

137

df.export('output.hdf5') # Uses export_hdf5 internally

138

139

# Export with options via DataFrame

140

df.export('output.hdf5', shuffle=True, progress=True)

141

```

142

143

### Legacy Format Export

144

145

```python

146

# Export to version 1 format for compatibility

147

vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')

148

149

# Version 1 with options

150

vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5',

151

shuffle=True, progress=True)

152

```

153

154

## Constants

155

156

```python { .api }

157

max_length = 100000 # Maximum processing chunk size

158

max_int32 = 2147483647 # Maximum 32-bit integer value

159

```

160

161

## Data Type Support

162

163

The export functions support all vaex data types:

164

165

- **Numeric types**: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64

166

- **String types**: Variable-length strings with efficient storage

167

- **Date/time types**: datetime64 with nanosecond precision

168

- **Boolean types**: Stored as uint8

169

- **Categorical types**: Dictionary-encoded strings

170

- **Sparse matrices**: CSR format sparse data

171

- **Masked arrays**: Arrays with missing value support

172

173

## Export Behavior

174

175

### Column Order Preservation

176

177

The export functions preserve column order and store it as metadata in the HDF5 file:

178

179

```python

180

# Original column order is maintained

181

df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})

182

vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')

183

df2 = vaex.open('ordered.hdf5')

184

print(df2.column_names) # ['z', 'a', 'm'] - order preserved

185

```

186

187

### Memory Efficiency

188

189

Both export functions use streaming processing to handle datasets larger than available memory:

190

191

- Data is processed in chunks to minimize memory usage

192

- Memory mapping is used when possible for optimal performance

193

- Temporary files are avoided through direct HDF5 writing

194

195

### Metadata Preservation

196

197

Export functions preserve DataFrame metadata:

198

199

- Column descriptions and units

200

- Custom metadata and properties

201

- Data provenance information (user, timestamp, source)

202

203

## Error Handling

204

205

Export functions may raise:

206

207

- `ValueError`: If the dataset is empty or has invalid parameters

208

- `OSError`: For file system errors (permissions, disk space)

209

- `h5py.H5Error`: For HDF5 format or writing errors

210

- `MemoryError`: If insufficient memory for processing

211

- `KeyboardInterrupt`: If user cancels during progress callback