or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

device-discovery.mderror-handling-ras.mdevent-monitoring.mdhardware-information.mdindex.mdlibrary-management.mdmemory-management.mdpcie-connectivity.mdperformance-control.mdperformance-counters.mdperformance-monitoring.mdprocess-system-info.md

memory-management.mddocs/

0

# Memory Management

1

2

Memory information including total memory, usage statistics, VRAM details, and memory error management for AMD GPU devices.

3

4

## Capabilities

5

6

### Memory Total and Usage

7

8

Get total memory and current usage for different memory types on the GPU.

9

10

```c { .api }

11

amdsmi_status_t amdsmi_get_gpu_memory_total(amdsmi_processor_handle processor_handle, amdsmi_memory_type_t mem_type, uint64_t *total);

12

amdsmi_status_t amdsmi_get_gpu_memory_usage(amdsmi_processor_handle processor_handle, amdsmi_memory_type_t mem_type, uint64_t *used);

13

```

14

15

**Parameters:**

16

- `processor_handle`: Handle to the GPU processor

17

- `mem_type`: Type of memory (VRAM, VIS_VRAM, GTT)

18

- `total`: Pointer to receive total memory amount in bytes

19

- `used`: Pointer to receive used memory amount in bytes

20

21

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

22

23

**Usage Example:**

24

25

```c

26

uint64_t total_vram, used_vram;

27

amdsmi_status_t ret;

28

29

// Get VRAM information

30

ret = amdsmi_get_gpu_memory_total(processor, AMDSMI_MEM_TYPE_VRAM, &total_vram);

31

if (ret == AMDSMI_STATUS_SUCCESS) {

32

ret = amdsmi_get_gpu_memory_usage(processor, AMDSMI_MEM_TYPE_VRAM, &used_vram);

33

if (ret == AMDSMI_STATUS_SUCCESS) {

34

printf("VRAM: %llu MB used / %llu MB total (%.1f%% usage)\n",

35

used_vram / (1024*1024), total_vram / (1024*1024),

36

(double)used_vram / total_vram * 100.0);

37

}

38

}

39

40

// Get GTT memory information

41

uint64_t total_gtt, used_gtt;

42

ret = amdsmi_get_gpu_memory_total(processor, AMDSMI_MEM_TYPE_GTT, &total_gtt);

43

ret = amdsmi_get_gpu_memory_usage(processor, AMDSMI_MEM_TYPE_GTT, &used_gtt);

44

```

45

46

### VRAM Usage Information

47

48

Get comprehensive VRAM usage information in a structured format.

49

50

```c { .api }

51

amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_handle, amdsmi_vram_info_t *info);

52

```

53

54

**Parameters:**

55

- `processor_handle`: Handle to the GPU processor

56

- `info`: Pointer to receive VRAM usage information

57

58

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

59

60

**Usage Example:**

61

62

```c

63

amdsmi_vram_info_t vram_info;

64

amdsmi_status_t ret = amdsmi_get_gpu_vram_usage(processor, &vram_info);

65

if (ret == AMDSMI_STATUS_SUCCESS) {

66

printf("VRAM Total: %u MB\n", vram_info.vram_total);

67

printf("VRAM Used: %u MB\n", vram_info.vram_used);

68

printf("VRAM Free: %u MB\n", vram_info.vram_total - vram_info.vram_used);

69

}

70

```

71

72

### Bad/Retired Page Information

73

74

Get information about bad or retired memory pages that are no longer usable.

75

76

```c { .api }

77

amdsmi_status_t amdsmi_get_gpu_bad_page_info(amdsmi_processor_handle processor_handle, uint32_t *num_pages, amdsmi_retired_page_record_t *info);

78

```

79

80

**Parameters:**

81

- `processor_handle`: Handle to the GPU processor

82

- `num_pages`: As input, maximum number of page records. As output, actual number available or written.

83

- `info`: Pointer to array of retired page records, or NULL to query count only

84

85

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

86

87

**Usage Example:**

88

89

```c

90

// First get the count of bad pages

91

uint32_t num_bad_pages = 0;

92

amdsmi_status_t ret = amdsmi_get_gpu_bad_page_info(processor, &num_bad_pages, NULL);

93

if (ret == AMDSMI_STATUS_SUCCESS && num_bad_pages > 0) {

94

// Allocate memory and get the page records

95

amdsmi_retired_page_record_t *bad_pages =

96

malloc(num_bad_pages * sizeof(amdsmi_retired_page_record_t));

97

98

ret = amdsmi_get_gpu_bad_page_info(processor, &num_bad_pages, bad_pages);

99

if (ret == AMDSMI_STATUS_SUCCESS) {

100

printf("Found %u bad memory pages:\n", num_bad_pages);

101

for (uint32_t i = 0; i < num_bad_pages; i++) {

102

printf(" Page %u: Address 0x%llx, Size %llu bytes, Status: %d\n",

103

i, bad_pages[i].page_address, bad_pages[i].page_size,

104

bad_pages[i].status);

105

}

106

}

107

free(bad_pages);

108

}

109

```

110

111

### Reserved Memory Pages

112

113

Get information about reserved (retired) memory pages across the system.

114

115

```c { .api }

116

amdsmi_status_t amdsmi_get_gpu_memory_reserved_pages(amdsmi_processor_handle processor_handle, uint32_t *num_pages, amdsmi_retired_page_record_t *records);

117

```

118

119

**Parameters:**

120

- `processor_handle`: Handle to the GPU processor

121

- `num_pages`: As input, maximum number of page records. As output, actual number available or written.

122

- `records`: Pointer to array of retired page records, or NULL to query count only

123

124

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

125

126

### RAS Block Features

127

128

Check if RAS (Reliability, Availability, Serviceability) features are enabled for specific GPU blocks.

129

130

```c { .api }

131

amdsmi_status_t amdsmi_get_gpu_ras_block_features_enabled(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_ras_err_state_t *state);

132

```

133

134

**Parameters:**

135

- `processor_handle`: Handle to the GPU processor

136

- `block`: GPU block to query (UMC, SDMA, GFX, etc.)

137

- `state`: Pointer to receive RAS error state

138

139

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

140

141

## Python API

142

143

### Memory Information

144

145

```python { .api }

146

def amdsmi_get_gpu_memory_total(processor_handle, mem_type):

147

"""

148

Get total memory for a specific memory type.

149

150

Args:

151

processor_handle: GPU processor handle

152

mem_type (AmdSmiMemoryType): Memory type to query

153

154

Returns:

155

int: Total memory in bytes

156

157

Raises:

158

AmdSmiException: If memory query fails

159

"""

160

161

def amdsmi_get_gpu_memory_usage(processor_handle, mem_type):

162

"""

163

Get used memory for a specific memory type.

164

165

Args:

166

processor_handle: GPU processor handle

167

mem_type (AmdSmiMemoryType): Memory type to query

168

169

Returns:

170

int: Used memory in bytes

171

172

Raises:

173

AmdSmiException: If memory query fails

174

"""

175

```

176

177

### VRAM Usage

178

179

```python { .api }

180

def amdsmi_get_gpu_vram_usage(processor_handle):

181

"""

182

Get VRAM usage information.

183

184

Args:

185

processor_handle: GPU processor handle

186

187

Returns:

188

dict: VRAM info with keys 'vram_total', 'vram_used' (in MB)

189

190

Raises:

191

AmdSmiException: If VRAM query fails

192

"""

193

```

194

195

### Bad Page Information

196

197

```python { .api }

198

def amdsmi_get_gpu_bad_page_info(processor_handle):

199

"""

200

Get information about bad/retired memory pages.

201

202

Args:

203

processor_handle: GPU processor handle

204

205

Returns:

206

list: List of bad page records, each with keys 'page_address',

207

'page_size', 'status'

208

209

Raises:

210

AmdSmiException: If bad page query fails

211

"""

212

```

213

214

**Python Usage Example:**

215

216

```python

217

import amdsmi

218

from amdsmi import AmdSmiMemoryType

219

220

# Initialize and get GPU handle

221

amdsmi.amdsmi_init()

222

223

try:

224

sockets = amdsmi.amdsmi_get_socket_handles()

225

processors = amdsmi.amdsmi_get_processor_handles(sockets[0])

226

gpu = processors[0]

227

228

# Get VRAM information using structured interface

229

vram_info = amdsmi.amdsmi_get_gpu_vram_usage(gpu)

230

print(f"VRAM: {vram_info['vram_used']} MB / {vram_info['vram_total']} MB")

231

usage_percent = (vram_info['vram_used'] / vram_info['vram_total']) * 100

232

print(f"VRAM Usage: {usage_percent:.1f}%")

233

234

# Get memory information by type

235

vram_total = amdsmi.amdsmi_get_gpu_memory_total(gpu, AmdSmiMemoryType.AMDSMI_MEM_TYPE_VRAM)

236

vram_used = amdsmi.amdsmi_get_gpu_memory_usage(gpu, AmdSmiMemoryType.AMDSMI_MEM_TYPE_VRAM)

237

print(f"VRAM (detailed): {vram_used // (1024*1024)} MB / {vram_total // (1024*1024)} MB")

238

239

# Get GTT memory information

240

gtt_total = amdsmi.amdsmi_get_gpu_memory_total(gpu, AmdSmiMemoryType.AMDSMI_MEM_TYPE_GTT)

241

gtt_used = amdsmi.amdsmi_get_gpu_memory_usage(gpu, AmdSmiMemoryType.AMDSMI_MEM_TYPE_GTT)

242

print(f"GTT Memory: {gtt_used // (1024*1024)} MB / {gtt_total // (1024*1024)} MB")

243

244

# Check for bad pages

245

bad_pages = amdsmi.amdsmi_get_gpu_bad_page_info(gpu)

246

if bad_pages:

247

print(f"Found {len(bad_pages)} bad memory pages:")

248

for i, page in enumerate(bad_pages):

249

print(f" Page {i}: Address 0x{page['page_address']:x}, "

250

f"Size {page['page_size']} bytes")

251

else:

252

print("No bad memory pages found")

253

254

finally:

255

amdsmi.amdsmi_shut_down()

256

```

257

258

## Types

259

260

### Memory Types

261

262

```c { .api }

263

typedef enum {

264

AMDSMI_MEM_TYPE_VRAM, // VRAM memory (device local)

265

AMDSMI_MEM_TYPE_VIS_VRAM, // Visible VRAM memory (CPU accessible)

266

AMDSMI_MEM_TYPE_GTT // GTT (Graphics Translation Table) memory

267

} amdsmi_memory_type_t;

268

```

269

270

### VRAM Information Structure

271

272

```c { .api }

273

typedef struct {

274

uint32_t vram_total; // Total VRAM in MB

275

uint32_t vram_used; // Used VRAM in MB

276

uint32_t reserved[2]; // Reserved for future use

277

} amdsmi_vram_info_t;

278

```

279

280

### Retired Page Record

281

282

```c { .api }

283

typedef struct {

284

uint64_t page_address; // Start address of the page

285

uint64_t page_size; // Size of the page in bytes

286

amdsmi_memory_page_status_t status; // Page status (reserved, pending, etc.)

287

} amdsmi_retired_page_record_t;

288

```

289

290

### Memory Page Status

291

292

```c { .api }

293

typedef enum {

294

AMDSMI_MEM_PAGE_STATUS_RESERVED, // Page is reserved and not available

295

AMDSMI_MEM_PAGE_STATUS_PENDING, // Page is marked bad, will be reserved

296

AMDSMI_MEM_PAGE_STATUS_UNRESERVABLE // Unable to reserve this page

297

} amdsmi_memory_page_status_t;

298

```

299

300

### GPU Blocks (for RAS)

301

302

```c { .api }

303

typedef enum {

304

AMDSMI_GPU_BLOCK_UMC = 0x0000000000000001, // UMC (Unified Memory Controller)

305

AMDSMI_GPU_BLOCK_SDMA = 0x0000000000000002, // SDMA (System DMA)

306

AMDSMI_GPU_BLOCK_GFX = 0x0000000000000004, // GFX (Graphics)

307

AMDSMI_GPU_BLOCK_MMHUB = 0x0000000000000008, // MMHUB (Multimedia Hub)

308

AMDSMI_GPU_BLOCK_ATHUB = 0x0000000000000010, // ATHUB (ATI Hub)

309

AMDSMI_GPU_BLOCK_PCIE_BIF = 0x0000000000000020, // PCIe BIF

310

AMDSMI_GPU_BLOCK_HDP = 0x0000000000000040, // HDP (Host Data Path)

311

AMDSMI_GPU_BLOCK_XGMI_WAFL = 0x0000000000000080,// XGMI

312

AMDSMI_GPU_BLOCK_DF = 0x0000000000000100, // Data Fabric

313

AMDSMI_GPU_BLOCK_SMN = 0x0000000000000200, // System Memory Network

314

AMDSMI_GPU_BLOCK_SEM = 0x0000000000000400, // SEM

315

AMDSMI_GPU_BLOCK_MP0 = 0x0000000000000800, // MP0 (Microprocessor 0)

316

AMDSMI_GPU_BLOCK_MP1 = 0x0000000000001000, // MP1 (Microprocessor 1)

317

AMDSMI_GPU_BLOCK_FUSE = 0x0000000000002000 // Fuse

318

} amdsmi_gpu_block_t;

319

```

320

321

### RAS Error States

322

323

```c { .api }

324

typedef enum {

325

AMDSMI_RAS_ERR_STATE_NONE = 0, // No current errors

326

AMDSMI_RAS_ERR_STATE_DISABLED, // ECC/RAS is disabled

327

AMDSMI_RAS_ERR_STATE_PARITY, // ECC errors present, type unknown

328

AMDSMI_RAS_ERR_STATE_SING_C, // Single correctable error

329

AMDSMI_RAS_ERR_STATE_MULT_UC, // Multiple uncorrectable errors

330

AMDSMI_RAS_ERR_STATE_POISON, // Firmware detected error, page isolated

331

AMDSMI_RAS_ERR_STATE_ENABLED // ECC/RAS is enabled

332

} amdsmi_ras_err_state_t;

333

```

334

335

## Memory Management Workflow

336

337

A typical memory monitoring workflow includes:

338

339

1. **Query Total Memory**: Use `amdsmi_get_gpu_memory_total()` to get total memory for each type

340

2. **Monitor Usage**: Use `amdsmi_get_gpu_memory_usage()` to track current memory consumption

341

3. **Check VRAM Status**: Use `amdsmi_get_gpu_vram_usage()` for structured VRAM information

342

4. **Monitor Health**: Check for bad pages with `amdsmi_get_gpu_bad_page_info()`

343

5. **Verify RAS**: Check RAS feature status for critical blocks

344

345

## Important Notes

346

347

1. **Memory Units**:

348

- `amdsmi_get_gpu_memory_*()` functions return values in bytes

349

- `amdsmi_get_gpu_vram_usage()` returns values in megabytes

350

351

2. **Memory Types**:

352

- **VRAM**: GPU's local high-speed memory

353

- **VIS_VRAM**: CPU-accessible portion of VRAM (typically smaller)

354

- **GTT**: System memory mapped for GPU access

355

356

3. **Bad Pages**: Indicate hardware problems and should be monitored in production systems

357

358

4. **RAS Features**: Reliability features that may not be available on all GPU models

359

360

5. **Virtual Machine Limitations**: Some memory management functions may have limited functionality in virtualized environments

361

362

6. **Memory Accounting**: Different memory types serve different purposes and have different performance characteristics