0
# Memory Mapping Utilities
1
2
Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays, different storage layouts, and efficient zero-copy operations.
3
4
## Capabilities
5
6
### Array Memory Mapping
7
8
Create memory-mapped arrays from file data with support for different storage backends.
9
10
```python { .api }
11
def mmap_array(mmap, file, offset, dtype, shape):
12
"""
13
Create memory-mapped array from file data.
14
15
Provides zero-copy access to file data through memory mapping
16
or file-based column access for non-mappable storage.
17
18
Parameters:
19
- mmap: Memory map object (mmap.mmap) or None
20
- file: File object for reading data
21
- offset: Byte offset in file where data starts
22
- dtype: NumPy data type of the array
23
- shape: Tuple defining array dimensions
24
25
Returns:
26
- numpy.ndarray: Memory-mapped array (if mmap provided)
27
- ColumnFile: File-based column for remote/non-mappable storage
28
29
Raises:
30
RuntimeError: If high-dimensional arrays requested from non-local files
31
"""
32
```
33
34
### HDF5 Dataset Memory Mapping
35
36
Memory map HDF5 datasets with support for data type conversion and masking.
37
38
```python { .api }
39
def h5mmap(mmap, file, data, mask=None):
40
"""
41
Memory map HDF5 dataset with optional mask support.
42
43
Handles HDF5-specific data layouts, attribute-based type conversion,
44
and masked array creation for datasets with missing values.
45
46
Parameters:
47
- mmap: Memory map object or None for non-mappable storage
48
- file: File object for dataset access
49
- data: HDF5 dataset to map
50
- mask: Optional HDF5 dataset containing mask data
51
52
Returns:
53
- numpy.ndarray: Memory-mapped array for contiguous datasets
54
- numpy.ma.MaskedArray: Masked array if mask provided
55
- ColumnNumpyLike: Column wrapper for non-contiguous datasets
56
- ColumnMaskedNumpy: Masked column for non-contiguous masked data
57
58
Notes:
59
- Handles special dtypes from HDF5 attributes (e.g., UTF-32 strings)
60
- Returns ColumnNumpyLike for chunked or non-contiguous datasets
61
- Supports both numpy-style masks and Arrow-style null bitmaps
62
"""
63
```
64
65
## Usage Examples
66
67
### Basic Memory Mapping
68
69
```python
70
import mmap
71
from vaex.hdf5.utils import mmap_array, h5mmap
72
import numpy as np
73
74
# Memory map a file region
75
with open('data.bin', 'rb') as f:
76
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
77
# Map 1000 float64 values starting at offset 0
78
array = mmap_array(mm, f, 0, np.float64, (1000,))
79
print(array.shape) # (1000,)
80
print(array.dtype) # float64
81
```
82
83
### HDF5 Dataset Mapping
84
85
```python
86
import h5py
87
from vaex.hdf5.utils import h5mmap
88
89
# Map HDF5 dataset
90
with open('data.hdf5', 'rb') as f:
91
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
92
with h5py.File(f, 'r') as h5f:
93
dataset = h5f['table/columns/x/data']
94
95
# Simple mapping
96
array = h5mmap(mm, f, dataset)
97
98
# With mask
99
mask_dataset = h5f['table/columns/x/mask']
100
masked_array = h5mmap(mm, f, dataset, mask_dataset)
101
```
102
103
### Remote Storage Mapping
104
105
```python
106
# For non-mappable storage (S3, etc.)
107
array = mmap_array(None, file_handle, offset, dtype, shape)
108
# Returns ColumnFile instead of numpy array
109
110
# HDF5 on remote storage
111
array = h5mmap(None, file_handle, hdf5_dataset)
112
# Returns ColumnNumpyLike for non-contiguous access
113
```
114
115
### Working with Special Data Types
116
117
```python
118
# UTF-32 strings stored as bytes
119
with h5py.File('strings.hdf5', 'r') as h5f:
120
dataset = h5f['string_column/data']
121
# Dataset has attributes: dtype="utf32", dlength=10
122
array = h5mmap(mm, f, dataset)
123
# Returns array with correct UTF-32 dtype
124
```
125
126
### Handling Empty Arrays
127
128
```python
129
# Empty datasets (common in sparse data)
130
empty_dataset = h5f['empty_column/data'] # len(dataset) == 0
131
array = h5mmap(mm, f, empty_dataset)
132
# Handles offset=None case gracefully
133
```
134
135
## Implementation Details
136
137
### Memory Mapping Strategy
138
139
The utilities use different strategies based on data characteristics:
140
141
1. **Contiguous data** → Direct memory mapping via `numpy.frombuffer`
142
2. **Non-contiguous data** → `ColumnNumpyLike` wrapper for lazy access
143
3. **Remote storage** → `ColumnFile` for streaming access
144
4. **Chunked datasets** → Column wrappers with decompression
145
146
### Data Type Handling
147
148
Special handling for various data types:
149
150
```python
151
# Datetime types → stored as int64 with dtype attribute
152
# UTF-32 strings → stored as uint8 with special attributes
153
# Masked arrays → combined with mask datasets
154
# Arrow nulls → null bitmap integration
155
```
156
157
### Performance Characteristics
158
159
- **Memory mapped arrays**: Zero-copy access, fastest performance
160
- **Column wrappers**: Lazy evaluation, memory efficient
161
- **File columns**: Streaming access, works with any storage backend
162
- **Masked arrays**: Efficient missing value handling
163
164
## Column Types Returned
165
166
### ColumnFile
167
168
```python { .api }
169
class ColumnFile:
170
"""
171
File-based column for non-memory-mappable storage.
172
173
Provides array-like interface for data stored in files
174
that cannot be memory mapped (remote storage, etc.).
175
"""
176
```
177
178
### ColumnNumpyLike
179
180
```python { .api }
181
class ColumnNumpyLike:
182
"""
183
Wrapper for HDF5 datasets that behave like NumPy arrays.
184
185
Used for chunked or non-contiguous datasets that cannot
186
be directly memory mapped.
187
"""
188
```
189
190
### ColumnMaskedNumpy
191
192
```python { .api }
193
class ColumnMaskedNumpy:
194
"""
195
Masked column wrapper for non-contiguous masked data.
196
197
Combines ColumnNumpyLike data with mask arrays for
198
efficient missing value handling.
199
"""
200
```
201
202
## Array Shape and Layout
203
204
### Multi-dimensional Arrays
205
206
```python
207
# 2D array mapping
208
shape = (1000, 3) # 1000 rows, 3 columns
209
array = mmap_array(mm, f, offset, np.float32, shape)
210
print(array.shape) # (1000, 3)
211
212
# High-dimensional arrays require local storage
213
try:
214
shape = (100, 10, 5)
215
array = mmap_array(None, remote_file, offset, dtype, shape)
216
except RuntimeError:
217
print("High-d arrays not supported for remote files")
218
```
219
220
### Stride Handling
221
222
The utilities automatically handle:
223
- Row-major (C-order) layouts
224
- Column-major (Fortran-order) layouts
225
- Custom stride patterns from HDF5
226
227
## Error Handling
228
229
The utility functions may raise:
230
231
- `RuntimeError`: For unsupported operations (high-d remote arrays)
232
- `ValueError`: For invalid parameters or inconsistent data
233
- `OSError`: For file access errors
234
- `h5py.H5Error`: For HDF5 dataset access errors
235
- `MemoryError`: If insufficient memory for mapping operations
236
237
## Best Practices
238
239
### Memory Management
240
241
```python
242
# Always use context managers for proper cleanup
243
with mmap.mmap(f.fileno(), 0) as mm:
244
array = mmap_array(mm, f, offset, dtype, shape)
245
# Use array...
246
# Memory map automatically closed
247
```
248
249
### Performance Optimization
250
251
```python
252
# Check if data is contiguous before mapping
253
if dataset.id.get_offset() is not None:
254
# Contiguous data - use memory mapping
255
array = h5mmap(mm, f, dataset)
256
else:
257
# Non-contiguous - expect column wrapper
258
column = h5mmap(None, f, dataset)
259
```
260
261
### Error Handling
262
263
```python
264
try:
265
array = h5mmap(mm, f, dataset, mask)
266
except OSError as e:
267
# Handle file access errors
268
print(f"Cannot access dataset: {e}")
269
except ValueError as e:
270
# Handle data format errors
271
print(f"Invalid dataset format: {e}")
272
```