0
# Data Export
1
2
High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.
3
4
## Capabilities
5
6
### HDF5 Version 2 Export
7
8
The main export function supporting the latest HDF5 format with advanced features.
9
10
```python { .api }
11
def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False,
12
selection=False, progress=None, virtual=True, sort=None,
13
ascending=True, parallel=True):
14
"""
15
Export dataset to HDF5 version 2 format.
16
17
This is the recommended export function supporting all modern features
18
including parallel processing, sorting, and advanced data types.
19
20
Parameters:
21
- dataset: DatasetLocal instance to export
22
- path: Output file path (str)
23
- column_names: List of column names to export (None for all columns)
24
- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
25
- shuffle: Export rows in random order (bool)
26
- selection: Export selection or all data (bool or selection name)
27
- progress: Progress callback function or True for default progress bar
28
- virtual: Export virtual columns (bool)
29
- sort: Column name to sort by (str)
30
- ascending: Sort in ascending order (bool)
31
- parallel: Use parallel processing (bool)
32
33
Raises:
34
ValueError: If dataset is empty (cannot export empty table)
35
"""
36
```
37
38
### HDF5 Version 1 Export
39
40
Legacy export function for compatibility with older vaex versions.
41
42
```python { .api }
43
def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False,
44
selection=False, progress=None, virtual=True):
45
"""
46
Export dataset to HDF5 version 1 format.
47
48
Legacy export function for compatibility. Use export_hdf5() for new projects.
49
50
Parameters:
51
- dataset: DatasetLocal instance to export
52
- path: Output file path (str)
53
- column_names: List of column names to export (None for all columns)
54
- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
55
- shuffle: Export rows in random order (bool)
56
- selection: Export selection or all data (bool or selection name)
57
- progress: Progress callback function or True for default progress bar
58
- virtual: Export virtual columns (bool)
59
60
Raises:
61
ValueError: If dataset is empty (cannot export empty table)
62
"""
63
```
64
65
## Usage Examples
66
67
### Basic Export
68
69
```python
70
import vaex
71
72
# Load DataFrame
73
df = vaex.from_csv('input.csv')
74
75
# Simple export
76
vaex.hdf5.export.export_hdf5(df, 'output.hdf5')
77
78
# Export specific columns
79
vaex.hdf5.export.export_hdf5(df, 'output.hdf5',
80
column_names=['col1', 'col2', 'col3'])
81
```
82
83
### Export with Options
84
85
```python
86
# Export with progress tracking
87
def progress_callback(fraction):
88
print(f"Export progress: {fraction*100:.1f}%")
89
return True # Continue processing
90
91
vaex.hdf5.export.export_hdf5(df, 'output.hdf5',
92
progress=progress_callback)
93
94
# Export with built-in progress bar
95
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)
96
97
# Export with shuffled rows
98
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)
99
100
# Export selection only
101
df_filtered = df[df.score > 0.5] # Create selection
102
vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)
103
```
104
105
### Export with Sorting
106
107
```python
108
# Sort by column during export
109
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',
110
sort='timestamp', ascending=True)
111
112
# Sort in descending order
113
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',
114
sort='score', ascending=False)
115
```
116
117
### Export Configuration
118
119
```python
120
# Big endian byte order
121
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')
122
123
# Disable parallel processing
124
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)
125
126
# Include virtual columns
127
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)
128
129
# Exclude virtual columns
130
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)
131
```
132
133
### Using DataFrame Export Method
134
135
```python
136
# DataFrames have export method that calls these functions
137
df.export('output.hdf5') # Uses export_hdf5 internally
138
139
# Export with options via DataFrame
140
df.export('output.hdf5', shuffle=True, progress=True)
141
```
142
143
### Legacy Format Export
144
145
```python
146
# Export to version 1 format for compatibility
147
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')
148
149
# Version 1 with options
150
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5',
151
shuffle=True, progress=True)
152
```
153
154
## Constants
155
156
```python { .api }
157
max_length = 100000 # Maximum processing chunk size
158
max_int32 = 2147483647 # Maximum 32-bit integer value
159
```
160
161
## Data Type Support
162
163
The export functions support all vaex data types:
164
165
- **Numeric types**: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
166
- **String types**: Variable-length strings with efficient storage
167
- **Date/time types**: datetime64 with nanosecond precision
168
- **Boolean types**: Stored as uint8
169
- **Categorical types**: Dictionary-encoded strings
170
- **Sparse matrices**: CSR format sparse data
171
- **Masked arrays**: Arrays with missing value support
172
173
## Export Behavior
174
175
### Column Order Preservation
176
177
The export functions preserve column order and store it as metadata in the HDF5 file:
178
179
```python
180
# Original column order is maintained
181
df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})
182
vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')
183
df2 = vaex.open('ordered.hdf5')
184
print(df2.column_names) # ['z', 'a', 'm'] - order preserved
185
```
186
187
### Memory Efficiency
188
189
Both export functions use streaming processing to handle datasets larger than available memory:
190
191
- Data is processed in chunks to minimize memory usage
192
- Memory mapping is used when possible for optimal performance
193
- Temporary files are avoided through direct HDF5 writing
194
195
### Metadata Preservation
196
197
Export functions preserve DataFrame metadata:
198
199
- Column descriptions and units
200
- Custom metadata and properties
201
- Data provenance information (user, timestamp, source)
202
203
## Error Handling
204
205
Export functions may raise:
206
207
- `ValueError`: If the dataset is empty or has invalid parameters
208
- `OSError`: For file system errors (permissions, disk space)
209
- `h5py.H5Error`: For HDF5 format or writing errors
210
- `MemoryError`: If insufficient memory for processing
211
- `KeyboardInterrupt`: If user cancels during progress callback