CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-vaex-hdf5

HDF5 file support for vaex DataFrame library with memory-mapped access and specialized format readers

Overview
Eval results
Files

data-export.mddocs/

Data Export

High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.

Capabilities

HDF5 Version 2 Export

The main export function supporting the latest HDF5 format with advanced features.

def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False, 
                selection=False, progress=None, virtual=True, sort=None, 
                ascending=True, parallel=True):
    """
    Export dataset to HDF5 version 2 format.
    
    This is the recommended export function supporting all modern features
    including parallel processing, sorting, and advanced data types.
    
    Parameters:
    - dataset: DatasetLocal instance to export
    - path: Output file path (str)
    - column_names: List of column names to export (None for all columns)
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
    - shuffle: Export rows in random order (bool)
    - selection: Export selection or all data (bool or selection name)
    - progress: Progress callback function or True for default progress bar
    - virtual: Export virtual columns (bool)
    - sort: Column name to sort by (str)
    - ascending: Sort in ascending order (bool)
    - parallel: Use parallel processing (bool)
    
    Raises:
    ValueError: If dataset is empty (cannot export empty table)
    """

HDF5 Version 1 Export

Legacy export function for compatibility with older vaex versions.

def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False, 
                   selection=False, progress=None, virtual=True):
    """
    Export dataset to HDF5 version 1 format.
    
    Legacy export function for compatibility. Use export_hdf5() for new projects.
    
    Parameters:
    - dataset: DatasetLocal instance to export
    - path: Output file path (str)
    - column_names: List of column names to export (None for all columns)
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
    - shuffle: Export rows in random order (bool)
    - selection: Export selection or all data (bool or selection name)
    - progress: Progress callback function or True for default progress bar
    - virtual: Export virtual columns (bool)
    
    Raises:
    ValueError: If dataset is empty (cannot export empty table)
    """

Usage Examples

Basic Export

import vaex

# Load DataFrame
df = vaex.from_csv('input.csv')

# Simple export
vaex.hdf5.export.export_hdf5(df, 'output.hdf5')

# Export specific columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
                             column_names=['col1', 'col2', 'col3'])

Export with Options

# Export with progress tracking
def progress_callback(fraction):
    print(f"Export progress: {fraction*100:.1f}%")
    return True  # Continue processing

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
                             progress=progress_callback)

# Export with built-in progress bar
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)

# Export with shuffled rows
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)

# Export selection only
df_filtered = df[df.score > 0.5]  # Create selection
vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)

Export with Sorting

# Sort by column during export
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
                             sort='timestamp', ascending=True)

# Sort in descending order
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
                             sort='score', ascending=False)

Export Configuration

# Big endian byte order
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')

# Disable parallel processing
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)

# Include virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)

# Exclude virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)

Using DataFrame Export Method

# DataFrames have export method that calls these functions
df.export('output.hdf5')  # Uses export_hdf5 internally

# Export with options via DataFrame
df.export('output.hdf5', shuffle=True, progress=True)

Legacy Format Export

# Export to version 1 format for compatibility
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')

# Version 1 with options
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5', 
                                shuffle=True, progress=True)

Constants

max_length = 100000  # Maximum processing chunk size
max_int32 = 2147483647  # Maximum 32-bit integer value

Data Type Support

The export functions support all vaex data types:

  • Numeric types: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
  • String types: Variable-length strings with efficient storage
  • Date/time types: datetime64 with nanosecond precision
  • Boolean types: Stored as uint8
  • Categorical types: Dictionary-encoded strings
  • Sparse matrices: CSR format sparse data
  • Masked arrays: Arrays with missing value support

Export Behavior

Column Order Preservation

The export functions preserve column order and store it as metadata in the HDF5 file:

# Original column order is maintained
df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})
vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')
df2 = vaex.open('ordered.hdf5')
print(df2.column_names)  # ['z', 'a', 'm'] - order preserved

Memory Efficiency

Both export functions use streaming processing to handle datasets larger than available memory:

  • Data is processed in chunks to minimize memory usage
  • Memory mapping is used when possible for optimal performance
  • Temporary files are avoided through direct HDF5 writing

Metadata Preservation

Export functions preserve DataFrame metadata:

  • Column descriptions and units
  • Custom metadata and properties
  • Data provenance information (user, timestamp, source)

Error Handling

Export functions may raise:

  • ValueError: If the dataset is empty or has invalid parameters
  • OSError: For file system errors (permissions, disk space)
  • h5py.H5Error: For HDF5 format or writing errors
  • MemoryError: If insufficient memory for processing
  • KeyboardInterrupt: If user cancels during progress callback

Install with Tessl CLI

npx tessl i tessl/pypi-vaex-hdf5

docs

data-export.md

dataset-reading.md

high-performance-writing.md

index.md

memory-mapping-utils.md

tile.json