or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

device-discovery.mderror-handling-ras.mdevent-monitoring.mdhardware-information.mdindex.mdlibrary-management.mdmemory-management.mdpcie-connectivity.mdperformance-control.mdperformance-counters.mdperformance-monitoring.mdprocess-system-info.md

error-handling-ras.mddocs/

0

# Error Handling and RAS

1

2

Error detection, RAS (Reliability, Availability, Serviceability) features, ECC error monitoring, and comprehensive error reporting for GPU reliability management.

3

4

## Capabilities

5

6

### ECC Error Counting

7

8

Get ECC (Error Correcting Code) error counts for specific GPU blocks.

9

10

```c { .api }

11

amdsmi_status_t amdsmi_get_gpu_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_error_count_t *ec);

12

amdsmi_status_t amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec);

13

```

14

15

**Parameters:**

16

- `processor_handle`: Handle to the GPU processor

17

- `block`: GPU block to query (UMC, SDMA, GFX, etc.)

18

- `ec`: Pointer to receive error count information

19

20

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

21

22

**Usage Example:**

23

24

```c

25

// Get ECC errors for UMC (memory controller) block

26

amdsmi_error_count_t ecc_count;

27

amdsmi_status_t ret = amdsmi_get_gpu_ecc_count(processor, AMDSMI_GPU_BLOCK_UMC, &ecc_count);

28

if (ret == AMDSMI_STATUS_SUCCESS) {

29

printf("UMC ECC Errors:\n");

30

printf(" Correctable: %llu\n", ecc_count.correctable_err);

31

printf(" Uncorrectable: %llu\n", ecc_count.uncorrectable_err);

32

}

33

34

// Get total ECC errors across all blocks

35

amdsmi_error_count_t total_ecc;

36

ret = amdsmi_get_gpu_total_ecc_count(processor, &total_ecc);

37

if (ret == AMDSMI_STATUS_SUCCESS) {

38

printf("Total ECC Errors:\n");

39

printf(" Correctable: %llu\n", total_ecc.correctable_err);

40

printf(" Uncorrectable: %llu\n", total_ecc.uncorrectable_err);

41

}

42

```

43

44

### ECC Configuration Status

45

46

Check which GPU blocks have ECC enabled and get ECC status.

47

48

```c { .api }

49

amdsmi_status_t amdsmi_get_gpu_ecc_enabled(amdsmi_processor_handle processor_handle, uint64_t *enabled_blocks);

50

amdsmi_status_t amdsmi_get_gpu_ecc_status(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_ras_err_state_t *state);

51

```

52

53

**Parameters:**

54

- `processor_handle`: Handle to the GPU processor

55

- `enabled_blocks`: Pointer to receive bitmask of ECC-enabled blocks

56

- `block`: GPU block to query for ECC status

57

- `state`: Pointer to receive ECC/RAS error state

58

59

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

60

61

**Usage Example:**

62

63

```c

64

// Check which blocks have ECC enabled

65

uint64_t enabled_blocks;

66

amdsmi_status_t ret = amdsmi_get_gpu_ecc_enabled(processor, &enabled_blocks);

67

if (ret == AMDSMI_STATUS_SUCCESS) {

68

printf("ECC Enabled Blocks: 0x%llx\n", enabled_blocks);

69

70

if (enabled_blocks & AMDSMI_GPU_BLOCK_UMC) {

71

printf(" UMC (Memory Controller) ECC enabled\n");

72

}

73

if (enabled_blocks & AMDSMI_GPU_BLOCK_SDMA) {

74

printf(" SDMA (System DMA) ECC enabled\n");

75

}

76

if (enabled_blocks & AMDSMI_GPU_BLOCK_GFX) {

77

printf(" GFX (Graphics) ECC enabled\n");

78

}

79

}

80

81

// Get ECC status for specific block

82

amdsmi_ras_err_state_t ecc_state;

83

ret = amdsmi_get_gpu_ecc_status(processor, AMDSMI_GPU_BLOCK_UMC, &ecc_state);

84

if (ret == AMDSMI_STATUS_SUCCESS) {

85

printf("UMC ECC Status: %d\n", ecc_state);

86

}

87

```

88

89

### RAS Block Features

90

91

Check if RAS features are enabled for specific GPU blocks.

92

93

```c { .api }

94

amdsmi_status_t amdsmi_get_gpu_ras_block_features_enabled(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_ras_err_state_t *state);

95

```

96

97

**Parameters:**

98

- `processor_handle`: Handle to the GPU processor

99

- `block`: GPU block to query

100

- `state`: Pointer to receive RAS error state

101

102

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

103

104

### Status Code Translation

105

106

Convert AMD SMI status codes to human-readable strings.

107

108

```c { .api }

109

amdsmi_status_t amdsmi_status_code_to_string(amdsmi_status_t status, const char **status_string);

110

```

111

112

**Parameters:**

113

- `status`: Status code to convert

114

- `status_string`: Pointer to receive status string (library-owned memory)

115

116

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

117

118

**Usage Example:**

119

120

```c

121

amdsmi_status_t result = amdsmi_get_gpu_activity(processor, &activity);

122

if (result != AMDSMI_STATUS_SUCCESS) {

123

const char *error_string;

124

amdsmi_status_t str_ret = amdsmi_status_code_to_string(result, &error_string);

125

if (str_ret == AMDSMI_STATUS_SUCCESS) {

126

printf("Error getting GPU activity: %s\n", error_string);

127

} else {

128

printf("Error getting GPU activity: Status code %d\n", result);

129

}

130

}

131

```

132

133

### XGMI Error Management

134

135

Get and reset XGMI (Infinity Fabric) error status.

136

137

```c { .api }

138

amdsmi_status_t amdsmi_gpu_xgmi_error_status(amdsmi_processor_handle processor_handle, amdsmi_xgmi_status_t *status);

139

amdsmi_status_t amdsmi_reset_gpu_xgmi_error(amdsmi_processor_handle processor_handle);

140

```

141

142

**Parameters:**

143

- `processor_handle`: Handle to the GPU processor

144

- `status`: Pointer to receive XGMI error status

145

146

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

147

148

**Usage Example:**

149

150

```c

151

// Check XGMI error status

152

amdsmi_xgmi_status_t xgmi_status;

153

amdsmi_status_t ret = amdsmi_gpu_xgmi_error_status(processor, &xgmi_status);

154

if (ret == AMDSMI_STATUS_SUCCESS) {

155

printf("XGMI Status: %d\n", xgmi_status);

156

157

// If errors are present, reset them

158

if (xgmi_status != AMDSMI_XGMI_STATUS_NO_ERRORS) {

159

ret = amdsmi_reset_gpu_xgmi_error(processor);

160

if (ret == AMDSMI_STATUS_SUCCESS) {

161

printf("XGMI errors reset\n");

162

}

163

}

164

}

165

```

166

167

## Python API

168

169

### ECC Error Information

170

171

```python { .api }

172

def amdsmi_get_gpu_ecc_count(processor_handle, block):

173

"""

174

Get ECC error count for a specific GPU block.

175

176

Args:

177

processor_handle: GPU processor handle

178

block (AmdSmiGpuBlock): GPU block to query

179

180

Returns:

181

dict: Error counts with keys 'correctable_err', 'uncorrectable_err'

182

183

Raises:

184

AmdSmiException: If ECC count query fails

185

"""

186

187

def amdsmi_get_gpu_total_ecc_count(processor_handle):

188

"""

189

Get total ECC error count across all GPU blocks.

190

191

Args:

192

processor_handle: GPU processor handle

193

194

Returns:

195

dict: Total error counts with keys 'correctable_err', 'uncorrectable_err'

196

197

Raises:

198

AmdSmiException: If total ECC count query fails

199

"""

200

```

201

202

### ECC Status

203

204

```python { .api }

205

def amdsmi_get_gpu_ecc_enabled(processor_handle):

206

"""

207

Get bitmask of ECC-enabled GPU blocks.

208

209

Args:

210

processor_handle: GPU processor handle

211

212

Returns:

213

int: Bitmask of enabled blocks

214

215

Raises:

216

AmdSmiException: If ECC enabled query fails

217

"""

218

219

def amdsmi_get_gpu_ecc_status(processor_handle, block):

220

"""

221

Get ECC status for a specific GPU block.

222

223

Args:

224

processor_handle: GPU processor handle

225

block (AmdSmiGpuBlock): GPU block to query

226

227

Returns:

228

AmdSmiRasErrState: ECC/RAS error state

229

230

Raises:

231

AmdSmiException: If ECC status query fails

232

"""

233

```

234

235

### Error Handling

236

237

```python { .api }

238

def amdsmi_status_code_to_string(status):

239

"""

240

Convert status code to human-readable string.

241

242

Args:

243

status (AmdSmiRetCode): Status code to convert

244

245

Returns:

246

str: Human-readable error description

247

248

Raises:

249

AmdSmiException: If status code conversion fails

250

"""

251

```

252

253

**Python Usage Example:**

254

255

```python

256

import amdsmi

257

from amdsmi import AmdSmiGpuBlock, AmdSmiRasErrState

258

259

# Initialize and get GPU handle

260

amdsmi.amdsmi_init()

261

262

try:

263

sockets = amdsmi.amdsmi_get_socket_handles()

264

processors = amdsmi.amdsmi_get_processor_handles(sockets[0])

265

gpu = processors[0]

266

267

# Check which blocks have ECC enabled

268

enabled_blocks = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)

269

print(f"ECC Enabled Blocks: 0x{enabled_blocks:x}")

270

271

# Check specific blocks

272

if enabled_blocks & AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC:

273

print("UMC (Memory Controller) has ECC enabled")

274

275

# Get UMC ECC error count

276

umc_errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC)

277

print(f"UMC ECC Errors: {umc_errors['correctable_err']} correctable, "

278

f"{umc_errors['uncorrectable_err']} uncorrectable")

279

280

# Get UMC ECC status

281

umc_status = amdsmi.amdsmi_get_gpu_ecc_status(gpu, AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC)

282

print(f"UMC ECC Status: {umc_status}")

283

284

# Get total ECC errors

285

total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)

286

print(f"Total ECC Errors: {total_errors['correctable_err']} correctable, "

287

f"{total_errors['uncorrectable_err']} uncorrectable")

288

289

# Check XGMI errors (if applicable)

290

try:

291

xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)

292

print(f"XGMI Status: {xgmi_status}")

293

except amdsmi.AmdSmiException as e:

294

print(f"XGMI status not available: {e}")

295

296

except amdsmi.AmdSmiException as e:

297

error_string = amdsmi.amdsmi_status_code_to_string(e.get_status_code())

298

print(f"Error: {error_string}")

299

300

finally:

301

amdsmi.amdsmi_shut_down()

302

```

303

304

## Types

305

306

### Error Count Structure

307

308

```c { .api }

309

typedef struct {

310

uint64_t correctable_err; // Number of correctable errors

311

uint64_t uncorrectable_err; // Number of uncorrectable errors

312

uint64_t reserved[2]; // Reserved for future use

313

} amdsmi_error_count_t;

314

```

315

316

### RAS Error States

317

318

```c { .api }

319

typedef enum {

320

AMDSMI_RAS_ERR_STATE_NONE = 0, // No current errors

321

AMDSMI_RAS_ERR_STATE_DISABLED, // ECC/RAS is disabled

322

AMDSMI_RAS_ERR_STATE_PARITY, // ECC errors present, type unknown

323

AMDSMI_RAS_ERR_STATE_SING_C, // Single correctable error

324

AMDSMI_RAS_ERR_STATE_MULT_UC, // Multiple uncorrectable errors

325

AMDSMI_RAS_ERR_STATE_POISON, // Firmware detected error, page isolated

326

AMDSMI_RAS_ERR_STATE_ENABLED // ECC/RAS is enabled

327

} amdsmi_ras_err_state_t;

328

```

329

330

### GPU Blocks

331

332

```c { .api }

333

typedef enum {

334

AMDSMI_GPU_BLOCK_UMC = 0x0000000000000001, // UMC (Unified Memory Controller)

335

AMDSMI_GPU_BLOCK_SDMA = 0x0000000000000002, // SDMA (System DMA)

336

AMDSMI_GPU_BLOCK_GFX = 0x0000000000000004, // GFX (Graphics)

337

AMDSMI_GPU_BLOCK_MMHUB = 0x0000000000000008, // MMHUB (Multimedia Hub)

338

AMDSMI_GPU_BLOCK_ATHUB = 0x0000000000000010, // ATHUB (ATI Hub)

339

AMDSMI_GPU_BLOCK_PCIE_BIF = 0x0000000000000020, // PCIe BIF

340

AMDSMI_GPU_BLOCK_HDP = 0x0000000000000040, // HDP (Host Data Path)

341

AMDSMI_GPU_BLOCK_XGMI_WAFL = 0x0000000000000080,// XGMI

342

AMDSMI_GPU_BLOCK_DF = 0x0000000000000100, // Data Fabric

343

AMDSMI_GPU_BLOCK_SMN = 0x0000000000000200, // System Memory Network

344

AMDSMI_GPU_BLOCK_SEM = 0x0000000000000400, // SEM

345

AMDSMI_GPU_BLOCK_MP0 = 0x0000000000000800, // MP0 (Microprocessor 0)

346

AMDSMI_GPU_BLOCK_MP1 = 0x0000000000001000, // MP1 (Microprocessor 1)

347

AMDSMI_GPU_BLOCK_FUSE = 0x0000000000002000 // Fuse

348

} amdsmi_gpu_block_t;

349

```

350

351

### XGMI Status

352

353

```c { .api }

354

typedef enum {

355

AMDSMI_XGMI_STATUS_NO_ERRORS = 0, // No XGMI errors

356

AMDSMI_XGMI_STATUS_ERROR, // XGMI errors present

357

AMDSMI_XGMI_STATUS_MULTIPLE_ERRORS // Multiple XGMI errors

358

} amdsmi_xgmi_status_t;

359

```

360

361

### Status Codes (Key Error Codes)

362

363

```c { .api }

364

typedef enum {

365

AMDSMI_STATUS_SUCCESS = 0, // Call succeeded

366

AMDSMI_STATUS_INVAL = 1, // Invalid parameters

367

AMDSMI_STATUS_NOT_SUPPORTED = 2, // Command not supported

368

AMDSMI_STATUS_NOT_YET_IMPLEMENTED = 3, // Not implemented yet

369

AMDSMI_STATUS_FAIL_LOAD_MODULE = 4, // Failed to load library

370

AMDSMI_STATUS_FAIL_LOAD_SYMBOL = 5, // Failed to load symbol

371

AMDSMI_STATUS_DRM_ERROR = 6, // Error when calling libdrm

372

AMDSMI_STATUS_API_FAILED = 7, // API call failed

373

AMDSMI_STATUS_TIMEOUT = 8, // Timeout in API call

374

AMDSMI_STATUS_RETRY = 9, // Retry operation

375

AMDSMI_STATUS_NO_PERM = 10, // Permission denied

376

AMDSMI_STATUS_INTERRUPT = 11, // Interrupt occurred during execution

377

AMDSMI_STATUS_IO = 12, // I/O Error

378

AMDSMI_STATUS_ADDRESS_FAULT = 13, // Bad address

379

AMDSMI_STATUS_FILE_ERROR = 14, // Problem accessing a file

380

AMDSMI_STATUS_OUT_OF_RESOURCES = 15, // Not enough memory

381

AMDSMI_STATUS_INTERNAL_EXCEPTION = 16, // Internal exception caught

382

AMDSMI_STATUS_INPUT_OUT_OF_BOUNDS = 17, // Input out of allowable range

383

AMDSMI_STATUS_INIT_ERROR = 18, // Error during initialization

384

AMDSMI_STATUS_REFCOUNT_OVERFLOW = 19, // Reference counter overflow

385

AMDSMI_STATUS_BUSY = 30, // Device busy

386

AMDSMI_STATUS_NOT_FOUND = 31, // Device not found

387

AMDSMI_STATUS_NOT_INIT = 32, // Device not initialized

388

AMDSMI_STATUS_NO_SLOT = 33, // No more free slot

389

AMDSMI_STATUS_NO_DATA = 40, // No data found

390

AMDSMI_STATUS_INSUFFICIENT_SIZE = 41, // Not enough resources

391

AMDSMI_STATUS_UNEXPECTED_SIZE = 42, // Unexpected amount of data

392

AMDSMI_STATUS_UNEXPECTED_DATA = 43, // Unexpected data

393

AMDSMI_STATUS_MAP_ERROR = 0xFFFFFFFE, // Internal library error mapping failed

394

AMDSMI_STATUS_UNKNOWN_ERROR = 0xFFFFFFFF // Unknown error occurred

395

} amdsmi_status_t;

396

```

397

398

## Error Handling Best Practices

399

400

1. **Always Check Return Values**: Every AMD SMI function returns a status code that should be checked.

401

402

2. **Use Status Code Translation**: Convert error codes to strings for better error reporting.

403

404

3. **Monitor ECC Errors**: Regularly check ECC error counts, especially in production environments.

405

406

4. **Handle Uncorrectable Errors**: Uncorrectable ECC errors may indicate serious hardware issues.

407

408

5. **Log Error Information**: Maintain logs of error counts and types for trend analysis.

409

410

6. **Reset XGMI Errors**: Clear XGMI errors after handling to prevent false positives.

411

412

## Important Notes

413

414

1. **ECC Availability**: ECC features depend on GPU model and may not be available on all devices.

415

416

2. **Error Types**:

417

- **Correctable Errors**: Automatically fixed by ECC, but indicate potential issues

418

- **Uncorrectable Errors**: Cannot be corrected, may cause data corruption or crashes

419

420

3. **GPU Blocks**: Different GPU blocks have different error reporting capabilities.

421

422

4. **RAS Features**: Reliability, Availability, and Serviceability features provide enterprise-grade error management.

423

424

5. **XGMI Errors**: Only relevant in multi-GPU systems with Infinity Fabric connections.

425

426

6. **Virtual Machine Limitations**: Some RAS features may not be available in virtualized environments.

427

428

7. **Production Monitoring**: ECC error monitoring is critical for data center and HPC environments.

429

430

8. **Error Thresholds**: Establish monitoring thresholds for correctable errors to predict hardware issues.