tessl/pypi-vaex-hdf5

HDF5 file support for vaex DataFrame library with memory-mapped access and specialized format readers

Overview

Eval results

Files

Data Export

Name: tessl/pypi-vaex-hdf5
Author: tessl

High-level functions for exporting vaex DataFrames to HDF5 format with support for both version 1 and version 2 formats, various export options, and progress tracking.

Capabilities

HDF5 Version 2 Export

The main export function supporting the latest HDF5 format with advanced features.

def export_hdf5(dataset, path, column_names=None, byteorder="=", shuffle=False, 
                selection=False, progress=None, virtual=True, sort=None, 
                ascending=True, parallel=True):
    """
    Export dataset to HDF5 version 2 format.
    
    This is the recommended export function supporting all modern features
    including parallel processing, sorting, and advanced data types.
    
    Parameters:
    - dataset: DatasetLocal instance to export
    - path: Output file path (str)
    - column_names: List of column names to export (None for all columns)
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
    - shuffle: Export rows in random order (bool)
    - selection: Export selection or all data (bool or selection name)
    - progress: Progress callback function or True for default progress bar
    - virtual: Export virtual columns (bool)
    - sort: Column name to sort by (str)
    - ascending: Sort in ascending order (bool)
    - parallel: Use parallel processing (bool)
    
    Raises:
    ValueError: If dataset is empty (cannot export empty table)
    """

HDF5 Version 1 Export

Legacy export function for compatibility with older vaex versions.

def export_hdf5_v1(dataset, path, column_names=None, byteorder="=", shuffle=False, 
                   selection=False, progress=None, virtual=True):
    """
    Export dataset to HDF5 version 1 format.
    
    Legacy export function for compatibility. Use export_hdf5() for new projects.
    
    Parameters:
    - dataset: DatasetLocal instance to export
    - path: Output file path (str)
    - column_names: List of column names to export (None for all columns)
    - byteorder: Byte order ("=" for native, "<" for little endian, ">" for big endian)
    - shuffle: Export rows in random order (bool)
    - selection: Export selection or all data (bool or selection name)
    - progress: Progress callback function or True for default progress bar
    - virtual: Export virtual columns (bool)
    
    Raises:
    ValueError: If dataset is empty (cannot export empty table)
    """

Usage Examples

Basic Export

import vaex

# Load DataFrame
df = vaex.from_csv('input.csv')

# Simple export
vaex.hdf5.export.export_hdf5(df, 'output.hdf5')

# Export specific columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
                             column_names=['col1', 'col2', 'col3'])

Export with Options

# Export with progress tracking
def progress_callback(fraction):
    print(f"Export progress: {fraction*100:.1f}%")
    return True  # Continue processing

vaex.hdf5.export.export_hdf5(df, 'output.hdf5', 
                             progress=progress_callback)

# Export with built-in progress bar
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', progress=True)

# Export with shuffled rows
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', shuffle=True)

# Export selection only
df_filtered = df[df.score > 0.5]  # Create selection
vaex.hdf5.export.export_hdf5(df_filtered, 'output.hdf5', selection=True)

Export with Sorting

# Sort by column during export
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
                             sort='timestamp', ascending=True)

# Sort in descending order
vaex.hdf5.export.export_hdf5(df, 'sorted_output.hdf5', 
                             sort='score', ascending=False)

Export Configuration

# Big endian byte order
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', byteorder='>')

# Disable parallel processing
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', parallel=False)

# Include virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=True)

# Exclude virtual columns
vaex.hdf5.export.export_hdf5(df, 'output.hdf5', virtual=False)

Using DataFrame Export Method

# DataFrames have export method that calls these functions
df.export('output.hdf5')  # Uses export_hdf5 internally

# Export with options via DataFrame
df.export('output.hdf5', shuffle=True, progress=True)

Legacy Format Export

# Export to version 1 format for compatibility
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5')

# Version 1 with options
vaex.hdf5.export.export_hdf5_v1(df, 'legacy_output.hdf5', 
                                shuffle=True, progress=True)

Constants

max_length = 100000  # Maximum processing chunk size
max_int32 = 2147483647  # Maximum 32-bit integer value

Data Type Support

The export functions support all vaex data types:

Numeric types: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
String types: Variable-length strings with efficient storage
Date/time types: datetime64 with nanosecond precision
Boolean types: Stored as uint8
Categorical types: Dictionary-encoded strings
Sparse matrices: CSR format sparse data
Masked arrays: Arrays with missing value support

Export Behavior

Column Order Preservation

The export functions preserve column order and store it as metadata in the HDF5 file:

# Original column order is maintained
df = vaex.from_dict({'z': [3], 'a': [1], 'm': [2]})
vaex.hdf5.export.export_hdf5(df, 'ordered.hdf5')
df2 = vaex.open('ordered.hdf5')
print(df2.column_names)  # ['z', 'a', 'm'] - order preserved

Memory Efficiency

Both export functions use streaming processing to handle datasets larger than available memory:

Data is processed in chunks to minimize memory usage
Memory mapping is used when possible for optimal performance
Temporary files are avoided through direct HDF5 writing

Metadata Preservation

Export functions preserve DataFrame metadata:

Column descriptions and units
Custom metadata and properties
Data provenance information (user, timestamp, source)

Error Handling

Export functions may raise:

ValueError: If the dataset is empty or has invalid parameters
OSError: For file system errors (permissions, disk space)
h5py.H5Error: For HDF5 format or writing errors
MemoryError: If insufficient memory for processing
KeyboardInterrupt: If user cancels during progress callback

Install with Tessl CLI

npx tessl i tessl/pypi-vaex-hdf5

docs

data-export.md

dataset-reading.md

high-performance-writing.md

index.md

memory-mapping-utils.md

tile.json