Tessl Tile for pypi/vaex-hdf5@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-export.md dataset-reading.md high-performance-writing.md index.md memory-mapping-utils.md

data-export.mddocs/

0
# Data Export
1

2
High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.
3

4
## Capabilities
5

6
### HDF5 Version 2 Export
7

8
The main export function supporting the latest HDF5 format with advanced features.
9

10
```python { .api }
11
def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False, 
12
                selection=False, progress=None, virtual=True, sort=None, 
13
                ascending=True, parallel=True):
14
    """
15
    Export dataset to HDF5 version 2 format.
16
    
17
    This is the recommended export function supporting all modern features
18
    including parallel processing, sorting, and advanced data types.
19
    
20
    Parameters:
21
    - dataset: DatasetLocal instance to export
22
    - path: Output file path (str)
23
    - column_names: List of column names to export (None for all columns)
24
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
25
    - shuffle: Export rows in random order (bool)
26
    - selection: Export selection or all data (bool or selection name)
27
    - progress: Progress callback function or True for default progress bar
28
    - virtual: Export virtual columns (bool)
29
    - sort: Column name to sort by (str)
30
    - ascending: Sort in ascending order (bool)
31
    - parallel: Use parallel processing (bool)
32
    
33
    Raises:
34
    ValueError: If dataset is empty (cannot export empty table)
35
    """
36
```
37

38
### HDF5 Version 1 Export
39

40
Legacy export function for compatibility with older vaex versions.
41

42
```python { .api }
43
def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False, 
44
                   selection=False, progress=None, virtual=True):
45
    """
46
    Export dataset to HDF5 version 1 format.
47
    
48
    Legacy export function for compatibility. Use export_hdf5() for new projects.
49
    
50
    Parameters:
51
    - dataset: DatasetLocal instance to export
52
    - path: Output file path (str)
53
    - column_names: List of column names to export (None for all columns)
54
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
55
    - shuffle: Export rows in random order (bool)
56
    - selection: Export selection or all data (bool or selection name)
57
    - progress: Progress callback function or True for default progress bar
58
    - virtual: Export virtual columns (bool)
59
    
60
    Raises:
61
    ValueError: If dataset is empty (cannot export empty table)
62
    """
63
```
64

65
## Usage Examples
66

67
### Basic Export
68

69
```python
70
import vaex
71

72
# Load DataFrame
73
df = vaex.from_csv('input.csv')
74

75
# Simple export
76
vaex.hdf5.export.export_hdf5(df, 'output.hdf5')
77

78
# Export specific columns
79
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
80
                             column_names=['col1', 'col2', 'col3'])
81
```
82

83
### Export with Options
84

85
```python
86
# Export with progress tracking
87
def progress_callback(fraction):
88
    print(f"Export progress: {fraction*100:.1f}%")
89
    return True  # Continue processing
90

91
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
92
                             progress=progress_callback)
93

94
# Export with built-in progress bar
95
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)
96

97
# Export with shuffled rows
98
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)
99

100
# Export selection only
101
df_filtered = df[df.score > 0.5]  # Create selection
102
vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)
103
```
104

105
### Export with Sorting
106

107
```python
108
# Sort by column during export
109
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
110
                             sort='timestamp', ascending=True)
111

112
# Sort in descending order
113
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
114
                             sort='score', ascending=False)
115
```
116

117
### Export Configuration
118

119
```python
120
# Big endian byte order
121
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')
122

123
# Disable parallel processing
124
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)
125

126
# Include virtual columns
127
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)
128

129
# Exclude virtual columns
130
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)
131
```
132

133
### Using DataFrame Export Method
134

135
```python
136
# DataFrames have export method that calls these functions
137
df.export('output.hdf5')  # Uses export_hdf5 internally
138

139
# Export with options via DataFrame
140
df.export('output.hdf5', shuffle=True, progress=True)
141
```
142

143
### Legacy Format Export
144

145
```python
146
# Export to version 1 format for compatibility
147
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')
148

149
# Version 1 with options
150
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5', 
151
                                shuffle=True, progress=True)
152
```
153

154
## Constants
155

156
```python { .api }
157
max_length = 100000  # Maximum processing chunk size
158
max_int32 = 2147483647  # Maximum 32-bit integer value
159
```
160

161
## Data Type Support
162

163
The export functions support all vaex data types:
164

165
- **Numeric types**: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
166
- **String types**: Variable-length strings with efficient storage
167
- **Date/time types**: datetime64 with nanosecond precision
168
- **Boolean types**: Stored as uint8
169
- **Categorical types**: Dictionary-encoded strings
170
- **Sparse matrices**: CSR format sparse data
171
- **Masked arrays**: Arrays with missing value support
172

173
## Export Behavior
174

175
### Column Order Preservation
176

177
The export functions preserve column order and store it as metadata in the HDF5 file:
178

179
```python
180
# Original column order is maintained
181
df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})
182
vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')
183
df2 = vaex.open('ordered.hdf5')
184
print(df2.column_names)  # ['z', 'a', 'm'] - order preserved
185
```
186

187
### Memory Efficiency
188

189
Both export functions use streaming processing to handle datasets larger than available memory:
190

191
- Data is processed in chunks to minimize memory usage
192
- Memory mapping is used when possible for optimal performance
193
- Temporary files are avoided through direct HDF5 writing
194

195
### Metadata Preservation
196

197
Export functions preserve DataFrame metadata:
198

199
- Column descriptions and units
200
- Custom metadata and properties
201
- Data provenance information (user, timestamp, source)
202

203
## Error Handling
204

205
Export functions may raise:
206

207
- `ValueError`: If the dataset is empty or has invalid parameters
208
- `OSError`: For file system errors (permissions, disk space)
209
- `h5py.H5Error`: For HDF5 format or writing errors
210
- `MemoryError`: If insufficient memory for processing
211
- `KeyboardInterrupt`: If user cancels during progress callback

Version

Tile

Files

data-export.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-export.mddocs/