HDF5 file support for vaex DataFrame library with memory-mapped access and specialized format readers
Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays, different storage layouts, and efficient zero-copy operations.
Create memory-mapped arrays from file data with support for different storage backends.
def mmap_array(mmap, file, offset, dtype, shape):
"""
Create memory-mapped array from file data.
Provides zero-copy access to file data through memory mapping
or file-based column access for non-mappable storage.
Parameters:
- mmap: Memory map object (mmap.mmap) or None
- file: File object for reading data
- offset: Byte offset in file where data starts
- dtype: NumPy data type of the array
- shape: Tuple defining array dimensions
Returns:
- numpy.ndarray: Memory-mapped array (if mmap provided)
- ColumnFile: File-based column for remote/non-mappable storage
Raises:
RuntimeError: If high-dimensional arrays requested from non-local files
"""Memory map HDF5 datasets with support for data type conversion and masking.
def h5mmap(mmap, file, data, mask=None):
"""
Memory map HDF5 dataset with optional mask support.
Handles HDF5-specific data layouts, attribute-based type conversion,
and masked array creation for datasets with missing values.
Parameters:
- mmap: Memory map object or None for non-mappable storage
- file: File object for dataset access
- data: HDF5 dataset to map
- mask: Optional HDF5 dataset containing mask data
Returns:
- numpy.ndarray: Memory-mapped array for contiguous datasets
- numpy.ma.MaskedArray: Masked array if mask provided
- ColumnNumpyLike: Column wrapper for non-contiguous datasets
- ColumnMaskedNumpy: Masked column for non-contiguous masked data
Notes:
- Handles special dtypes from HDF5 attributes (e.g., UTF-32 strings)
- Returns ColumnNumpyLike for chunked or non-contiguous datasets
- Supports both numpy-style masks and Arrow-style null bitmaps
"""import mmap
from vaex.hdf5.utils import mmap_array, h5mmap
import numpy as np
# Memory map a file region
with open('data.bin', 'rb') as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# Map 1000 float64 values starting at offset 0
array = mmap_array(mm, f, 0, np.float64, (1000,))
print(array.shape) # (1000,)
print(array.dtype) # float64import h5py
from vaex.hdf5.utils import h5mmap
# Map HDF5 dataset
with open('data.hdf5', 'rb') as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
with h5py.File(f, 'r') as h5f:
dataset = h5f['table/columns/x/data']
# Simple mapping
array = h5mmap(mm, f, dataset)
# With mask
mask_dataset = h5f['table/columns/x/mask']
masked_array = h5mmap(mm, f, dataset, mask_dataset)# For non-mappable storage (S3, etc.)
array = mmap_array(None, file_handle, offset, dtype, shape)
# Returns ColumnFile instead of numpy array
# HDF5 on remote storage
array = h5mmap(None, file_handle, hdf5_dataset)
# Returns ColumnNumpyLike for non-contiguous access# UTF-32 strings stored as bytes
with h5py.File('strings.hdf5', 'r') as h5f:
dataset = h5f['string_column/data']
# Dataset has attributes: dtype="utf32", dlength=10
array = h5mmap(mm, f, dataset)
# Returns array with correct UTF-32 dtype# Empty datasets (common in sparse data)
empty_dataset = h5f['empty_column/data'] # len(dataset) == 0
array = h5mmap(mm, f, empty_dataset)
# Handles offset=None case gracefullyThe utilities use different strategies based on data characteristics:
numpy.frombufferColumnNumpyLike wrapper for lazy accessColumnFile for streaming accessSpecial handling for various data types:
# Datetime types → stored as int64 with dtype attribute
# UTF-32 strings → stored as uint8 with special attributes
# Masked arrays → combined with mask datasets
# Arrow nulls → null bitmap integrationclass ColumnFile:
"""
File-based column for non-memory-mappable storage.
Provides array-like interface for data stored in files
that cannot be memory mapped (remote storage, etc.).
"""class ColumnNumpyLike:
"""
Wrapper for HDF5 datasets that behave like NumPy arrays.
Used for chunked or non-contiguous datasets that cannot
be directly memory mapped.
"""class ColumnMaskedNumpy:
"""
Masked column wrapper for non-contiguous masked data.
Combines ColumnNumpyLike data with mask arrays for
efficient missing value handling.
"""# 2D array mapping
shape = (1000, 3) # 1000 rows, 3 columns
array = mmap_array(mm, f, offset, np.float32, shape)
print(array.shape) # (1000, 3)
# High-dimensional arrays require local storage
try:
shape = (100, 10, 5)
array = mmap_array(None, remote_file, offset, dtype, shape)
except RuntimeError:
print("High-d arrays not supported for remote files")The utilities automatically handle:
The utility functions may raise:
RuntimeError: For unsupported operations (high-d remote arrays)ValueError: For invalid parameters or inconsistent dataOSError: For file access errorsh5py.H5Error: For HDF5 dataset access errorsMemoryError: If insufficient memory for mapping operations# Always use context managers for proper cleanup
with mmap.mmap(f.fileno(), 0) as mm:
array = mmap_array(mm, f, offset, dtype, shape)
# Use array...
# Memory map automatically closed# Check if data is contiguous before mapping
if dataset.id.get_offset() is not None:
# Contiguous data - use memory mapping
array = h5mmap(mm, f, dataset)
else:
# Non-contiguous - expect column wrapper
column = h5mmap(None, f, dataset)try:
array = h5mmap(mm, f, dataset, mask)
except OSError as e:
# Handle file access errors
print(f"Cannot access dataset: {e}")
except ValueError as e:
# Handle data format errors
print(f"Invalid dataset format: {e}")Install with Tessl CLI
npx tessl i tessl/pypi-vaex-hdf5