HDF5 file support for vaex DataFrame library with memory-mapped access and specialized format readers
High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.
The main export function supporting the latest HDF5 format with advanced features.
def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False,
selection=False, progress=None, virtual=True, sort=None,
ascending=True, parallel=True):
"""
Export dataset to HDF5 version 2 format.
This is the recommended export function supporting all modern features
including parallel processing, sorting, and advanced data types.
Parameters:
- dataset: DatasetLocal instance to export
- path: Output file path (str)
- column_names: List of column names to export (None for all columns)
- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
- shuffle: Export rows in random order (bool)
- selection: Export selection or all data (bool or selection name)
- progress: Progress callback function or True for default progress bar
- virtual: Export virtual columns (bool)
- sort: Column name to sort by (str)
- ascending: Sort in ascending order (bool)
- parallel: Use parallel processing (bool)
Raises:
ValueError: If dataset is empty (cannot export empty table)
"""Legacy export function for compatibility with older vaex versions.
def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False,
selection=False, progress=None, virtual=True):
"""
Export dataset to HDF5 version 1 format.
Legacy export function for compatibility. Use export_hdf5() for new projects.
Parameters:
- dataset: DatasetLocal instance to export
- path: Output file path (str)
- column_names: List of column names to export (None for all columns)
- byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
- shuffle: Export rows in random order (bool)
- selection: Export selection or all data (bool or selection name)
- progress: Progress callback function or True for default progress bar
- virtual: Export virtual columns (bool)
Raises:
ValueError: If dataset is empty (cannot export empty table)
"""import vaex
# Load DataFrame
df = vaex.from_csv('input.csv')
# Simple export
vaex.hdf5.export.export_hdf5(df, 'output.hdf5')
# Export specific columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5',
column_names=['col1', 'col2', 'col3'])# Export with progress tracking
def progress_callback(fraction):
print(f"Export progress: {fraction*100:.1f}%")
return True # Continue processing
vaex.hdf5.export.export_hdf5(df, 'output.hdf5',
progress=progress_callback)
# Export with built-in progress bar
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)
# Export with shuffled rows
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)
# Export selection only
df_filtered = df[df.score > 0.5] # Create selection
vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)# Sort by column during export
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',
sort='timestamp', ascending=True)
# Sort in descending order
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5',
sort='score', ascending=False)# Big endian byte order
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')
# Disable parallel processing
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)
# Include virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)
# Exclude virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)# DataFrames have export method that calls these functions
df.export('output.hdf5') # Uses export_hdf5 internally
# Export with options via DataFrame
df.export('output.hdf5', shuffle=True, progress=True)# Export to version 1 format for compatibility
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')
# Version 1 with options
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5',
shuffle=True, progress=True)max_length = 100000 # Maximum processing chunk size
max_int32 = 2147483647 # Maximum 32-bit integer valueThe export functions support all vaex data types:
The export functions preserve column order and store it as metadata in the HDF5 file:
# Original column order is maintained
df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})
vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')
df2 = vaex.open('ordered.hdf5')
print(df2.column_names) # ['z', 'a', 'm'] - order preservedBoth export functions use streaming processing to handle datasets larger than available memory:
Export functions preserve DataFrame metadata:
Export functions may raise:
ValueError: If the dataset is empty or has invalid parametersOSError: For file system errors (permissions, disk space)h5py.H5Error: For HDF5 format or writing errorsMemoryError: If insufficient memory for processingKeyboardInterrupt: If user cancels during progress callbackInstall with Tessl CLI
npx tessl i tessl/pypi-vaex-hdf5