or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-export.mddataset-reading.mdhigh-performance-writing.mdindex.mdmemory-mapping-utils.md

memory-mapping-utils.mddocs/

0

# Memory Mapping Utilities

1

2

Low-level utilities for memory mapping HDF5 datasets and arrays with support for masked arrays, different storage layouts, and efficient zero-copy operations.

3

4

## Capabilities

5

6

### Array Memory Mapping

7

8

Create memory-mapped arrays from file data with support for different storage backends.

9

10

```python { .api }

11

def mmap_array(mmap, file, offset, dtype, shape):

12

"""

13

Create memory-mapped array from file data.

14

15

Provides zero-copy access to file data through memory mapping

16

or file-based column access for non-mappable storage.

17

18

Parameters:

19

- mmap: Memory map object (mmap.mmap) or None

20

- file: File object for reading data

21

- offset: Byte offset in file where data starts

22

- dtype: NumPy data type of the array

23

- shape: Tuple defining array dimensions

24

25

Returns:

26

- numpy.ndarray: Memory-mapped array (if mmap provided)

27

- ColumnFile: File-based column for remote/non-mappable storage

28

29

Raises:

30

RuntimeError: If high-dimensional arrays requested from non-local files

31

"""

32

```

33

34

### HDF5 Dataset Memory Mapping

35

36

Memory map HDF5 datasets with support for data type conversion and masking.

37

38

```python { .api }

39

def h5mmap(mmap, file, data, mask=None):

40

"""

41

Memory map HDF5 dataset with optional mask support.

42

43

Handles HDF5-specific data layouts, attribute-based type conversion,

44

and masked array creation for datasets with missing values.

45

46

Parameters:

47

- mmap: Memory map object or None for non-mappable storage

48

- file: File object for dataset access

49

- data: HDF5 dataset to map

50

- mask: Optional HDF5 dataset containing mask data

51

52

Returns:

53

- numpy.ndarray: Memory-mapped array for contiguous datasets

54

- numpy.ma.MaskedArray: Masked array if mask provided

55

- ColumnNumpyLike: Column wrapper for non-contiguous datasets

56

- ColumnMaskedNumpy: Masked column for non-contiguous masked data

57

58

Notes:

59

- Handles special dtypes from HDF5 attributes (e.g., UTF-32 strings)

60

- Returns ColumnNumpyLike for chunked or non-contiguous datasets

61

- Supports both numpy-style masks and Arrow-style null bitmaps

62

"""

63

```

64

65

## Usage Examples

66

67

### Basic Memory Mapping

68

69

```python

70

import mmap

71

from vaex.hdf5.utils import mmap_array, h5mmap

72

import numpy as np

73

74

# Memory map a file region

75

with open('data.bin', 'rb') as f:

76

with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:

77

# Map 1000 float64 values starting at offset 0

78

array = mmap_array(mm, f, 0, np.float64, (1000,))

79

print(array.shape) # (1000,)

80

print(array.dtype) # float64

81

```

82

83

### HDF5 Dataset Mapping

84

85

```python

86

import h5py

87

from vaex.hdf5.utils import h5mmap

88

89

# Map HDF5 dataset

90

with open('data.hdf5', 'rb') as f:

91

with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:

92

with h5py.File(f, 'r') as h5f:

93

dataset = h5f['table/columns/x/data']

94

95

# Simple mapping

96

array = h5mmap(mm, f, dataset)

97

98

# With mask

99

mask_dataset = h5f['table/columns/x/mask']

100

masked_array = h5mmap(mm, f, dataset, mask_dataset)

101

```

102

103

### Remote Storage Mapping

104

105

```python

106

# For non-mappable storage (S3, etc.)

107

array = mmap_array(None, file_handle, offset, dtype, shape)

108

# Returns ColumnFile instead of numpy array

109

110

# HDF5 on remote storage

111

array = h5mmap(None, file_handle, hdf5_dataset)

112

# Returns ColumnNumpyLike for non-contiguous access

113

```

114

115

### Working with Special Data Types

116

117

```python

118

# UTF-32 strings stored as bytes

119

with h5py.File('strings.hdf5', 'r') as h5f:

120

dataset = h5f['string_column/data']

121

# Dataset has attributes: dtype="utf32", dlength=10

122

array = h5mmap(mm, f, dataset)

123

# Returns array with correct UTF-32 dtype

124

```

125

126

### Handling Empty Arrays

127

128

```python

129

# Empty datasets (common in sparse data)

130

empty_dataset = h5f['empty_column/data'] # len(dataset) == 0

131

array = h5mmap(mm, f, empty_dataset)

132

# Handles offset=None case gracefully

133

```

134

135

## Implementation Details

136

137

### Memory Mapping Strategy

138

139

The utilities use different strategies based on data characteristics:

140

141

1. **Contiguous data** → Direct memory mapping via `numpy.frombuffer`

142

2. **Non-contiguous data**`ColumnNumpyLike` wrapper for lazy access

143

3. **Remote storage**`ColumnFile` for streaming access

144

4. **Chunked datasets** → Column wrappers with decompression

145

146

### Data Type Handling

147

148

Special handling for various data types:

149

150

```python

151

# Datetime types → stored as int64 with dtype attribute

152

# UTF-32 strings → stored as uint8 with special attributes

153

# Masked arrays → combined with mask datasets

154

# Arrow nulls → null bitmap integration

155

```

156

157

### Performance Characteristics

158

159

- **Memory mapped arrays**: Zero-copy access, fastest performance

160

- **Column wrappers**: Lazy evaluation, memory efficient

161

- **File columns**: Streaming access, works with any storage backend

162

- **Masked arrays**: Efficient missing value handling

163

164

## Column Types Returned

165

166

### ColumnFile

167

168

```python { .api }

169

class ColumnFile:

170

"""

171

File-based column for non-memory-mappable storage.

172

173

Provides array-like interface for data stored in files

174

that cannot be memory mapped (remote storage, etc.).

175

"""

176

```

177

178

### ColumnNumpyLike

179

180

```python { .api }

181

class ColumnNumpyLike:

182

"""

183

Wrapper for HDF5 datasets that behave like NumPy arrays.

184

185

Used for chunked or non-contiguous datasets that cannot

186

be directly memory mapped.

187

"""

188

```

189

190

### ColumnMaskedNumpy

191

192

```python { .api }

193

class ColumnMaskedNumpy:

194

"""

195

Masked column wrapper for non-contiguous masked data.

196

197

Combines ColumnNumpyLike data with mask arrays for

198

efficient missing value handling.

199

"""

200

```

201

202

## Array Shape and Layout

203

204

### Multi-dimensional Arrays

205

206

```python

207

# 2D array mapping

208

shape = (1000, 3) # 1000 rows, 3 columns

209

array = mmap_array(mm, f, offset, np.float32, shape)

210

print(array.shape) # (1000, 3)

211

212

# High-dimensional arrays require local storage

213

try:

214

shape = (100, 10, 5)

215

array = mmap_array(None, remote_file, offset, dtype, shape)

216

except RuntimeError:

217

print("High-d arrays not supported for remote files")

218

```

219

220

### Stride Handling

221

222

The utilities automatically handle:

223

- Row-major (C-order) layouts

224

- Column-major (Fortran-order) layouts

225

- Custom stride patterns from HDF5

226

227

## Error Handling

228

229

The utility functions may raise:

230

231

- `RuntimeError`: For unsupported operations (high-d remote arrays)

232

- `ValueError`: For invalid parameters or inconsistent data

233

- `OSError`: For file access errors

234

- `h5py.H5Error`: For HDF5 dataset access errors

235

- `MemoryError`: If insufficient memory for mapping operations

236

237

## Best Practices

238

239

### Memory Management

240

241

```python

242

# Always use context managers for proper cleanup

243

with mmap.mmap(f.fileno(), 0) as mm:

244

array = mmap_array(mm, f, offset, dtype, shape)

245

# Use array...

246

# Memory map automatically closed

247

```

248

249

### Performance Optimization

250

251

```python

252

# Check if data is contiguous before mapping

253

if dataset.id.get_offset() is not None:

254

# Contiguous data - use memory mapping

255

array = h5mmap(mm, f, dataset)

256

else:

257

# Non-contiguous - expect column wrapper

258

column = h5mmap(None, f, dataset)

259

```

260

261

### Error Handling

262

263

```python

264

try:

265

array = h5mmap(mm, f, dataset, mask)

266

except OSError as e:

267

# Handle file access errors

268

print(f"Cannot access dataset: {e}")

269

except ValueError as e:

270

# Handle data format errors

271

print(f"Invalid dataset format: {e}")

272

```