CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-vaex-hdf5

HDF5 file support for vaex DataFrame library with memory-mapped access and specialized format readers

Overview
Eval results
Files

memory-mapping-utils.mddocs/

Memory Mapping Utilities

Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays, different storage layouts, and efficient zero-copy operations.

Capabilities

Array Memory Mapping

Create memory-mapped arrays from file data with support for different storage backends.

def mmap_array(mmap, file, offset, dtype, shape):
    """
    Create memory-mapped array from file data.
    
    Provides zero-copy access to file data through memory mapping
    or file-based column access for non-mappable storage.
    
    Parameters:
    - mmap: Memory map object (mmap.mmap) or None
    - file: File object for reading data
    - offset: Byte offset in file where data starts
    - dtype: NumPy data type of the array
    - shape: Tuple defining array dimensions
    
    Returns:
    - numpy.ndarray: Memory-mapped array (if mmap provided)
    - ColumnFile: File-based column for remote/non-mappable storage
    
    Raises:
    RuntimeError: If high-dimensional arrays requested from non-local files
    """

HDF5 Dataset Memory Mapping

Memory map HDF5 datasets with support for data type conversion and masking.

def h5mmap(mmap, file, data, mask=None):
    """
    Memory map HDF5 dataset with optional mask support.
    
    Handles HDF5-specific data layouts, attribute-based type conversion,
    and masked array creation for datasets with missing values.
    
    Parameters:
    - mmap: Memory map object or None for non-mappable storage
    - file: File object for dataset access
    - data: HDF5 dataset to map
    - mask: Optional HDF5 dataset containing mask data
    
    Returns:
    - numpy.ndarray: Memory-mapped array for contiguous datasets
    - numpy.ma.MaskedArray: Masked array if mask provided
    - ColumnNumpyLike: Column wrapper for non-contiguous datasets
    - ColumnMaskedNumpy: Masked column for non-contiguous masked data
    
    Notes:
    - Handles special dtypes from HDF5 attributes (e.g., UTF-32 strings)
    - Returns ColumnNumpyLike for chunked or non-contiguous datasets
    - Supports both numpy-style masks and Arrow-style null bitmaps
    """

Usage Examples

Basic Memory Mapping

import mmap
from vaex.hdf5.utils import mmap_array, h5mmap
import numpy as np

# Memory map a file region
with open('data.bin', 'rb') as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        # Map 1000 float64 values starting at offset 0
        array = mmap_array(mm, f, 0, np.float64, (1000,))
        print(array.shape)  # (1000,)
        print(array.dtype)  # float64

HDF5 Dataset Mapping

import h5py
from vaex.hdf5.utils import h5mmap

# Map HDF5 dataset
with open('data.hdf5', 'rb') as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        with h5py.File(f, 'r') as h5f:
            dataset = h5f['table/columns/x/data']
            
            # Simple mapping
            array = h5mmap(mm, f, dataset)
            
            # With mask
            mask_dataset = h5f['table/columns/x/mask']
            masked_array = h5mmap(mm, f, dataset, mask_dataset)

Remote Storage Mapping

# For non-mappable storage (S3, etc.)
array = mmap_array(None, file_handle, offset, dtype, shape)
# Returns ColumnFile instead of numpy array

# HDF5 on remote storage
array = h5mmap(None, file_handle, hdf5_dataset)
# Returns ColumnNumpyLike for non-contiguous access

Working with Special Data Types

# UTF-32 strings stored as bytes
with h5py.File('strings.hdf5', 'r') as h5f:
    dataset = h5f['string_column/data']
    # Dataset has attributes: dtype="utf32", dlength=10
    array = h5mmap(mm, f, dataset)
    # Returns array with correct UTF-32 dtype

Handling Empty Arrays

# Empty datasets (common in sparse data)
empty_dataset = h5f['empty_column/data']  # len(dataset) == 0
array = h5mmap(mm, f, empty_dataset)
# Handles offset=None case gracefully

Implementation Details

Memory Mapping Strategy

The utilities use different strategies based on data characteristics:

  1. Contiguous data → Direct memory mapping via numpy.frombuffer
  2. Non-contiguous dataColumnNumpyLike wrapper for lazy access
  3. Remote storageColumnFile for streaming access
  4. Chunked datasets → Column wrappers with decompression

Data Type Handling

Special handling for various data types:

# Datetime types → stored as int64 with dtype attribute
# UTF-32 strings → stored as uint8 with special attributes  
# Masked arrays → combined with mask datasets
# Arrow nulls → null bitmap integration

Performance Characteristics

  • Memory mapped arrays: Zero-copy access, fastest performance
  • Column wrappers: Lazy evaluation, memory efficient
  • File columns: Streaming access, works with any storage backend
  • Masked arrays: Efficient missing value handling

Column Types Returned

ColumnFile

class ColumnFile:
    """
    File-based column for non-memory-mappable storage.
    
    Provides array-like interface for data stored in files
    that cannot be memory mapped (remote storage, etc.).
    """

ColumnNumpyLike

class ColumnNumpyLike:
    """
    Wrapper for HDF5 datasets that behave like NumPy arrays.
    
    Used for chunked or non-contiguous datasets that cannot
    be directly memory mapped.
    """

ColumnMaskedNumpy

class ColumnMaskedNumpy:
    """
    Masked column wrapper for non-contiguous masked data.
    
    Combines ColumnNumpyLike data with mask arrays for
    efficient missing value handling.
    """

Array Shape and Layout

Multi-dimensional Arrays

# 2D array mapping
shape = (1000, 3)  # 1000 rows, 3 columns
array = mmap_array(mm, f, offset, np.float32, shape)
print(array.shape)  # (1000, 3)

# High-dimensional arrays require local storage
try:
    shape = (100, 10, 5)
    array = mmap_array(None, remote_file, offset, dtype, shape)
except RuntimeError:
    print("High-d arrays not supported for remote files")

Stride Handling

The utilities automatically handle:

  • Row-major (C-order) layouts
  • Column-major (Fortran-order) layouts
  • Custom stride patterns from HDF5

Error Handling

The utility functions may raise:

  • RuntimeError: For unsupported operations (high-d remote arrays)
  • ValueError: For invalid parameters or inconsistent data
  • OSError: For file access errors
  • h5py.H5Error: For HDF5 dataset access errors
  • MemoryError: If insufficient memory for mapping operations

Best Practices

Memory Management

# Always use context managers for proper cleanup
with mmap.mmap(f.fileno(), 0) as mm:
    array = mmap_array(mm, f, offset, dtype, shape)
    # Use array...
# Memory map automatically closed

Performance Optimization

# Check if data is contiguous before mapping
if dataset.id.get_offset() is not None:
    # Contiguous data - use memory mapping
    array = h5mmap(mm, f, dataset)
else:
    # Non-contiguous - expect column wrapper
    column = h5mmap(None, f, dataset)

Error Handling

try:
    array = h5mmap(mm, f, dataset, mask)
except OSError as e:
    # Handle file access errors
    print(f"Cannot access dataset: {e}")
except ValueError as e:
    # Handle data format errors
    print(f"Invalid dataset format: {e}")

Install with Tessl CLI

npx tessl i tessl/pypi-vaex-hdf5

docs

data-export.md

dataset-reading.md

high-performance-writing.md

index.md

memory-mapping-utils.md

tile.json