0
# Error Handling and RAS
1
2
Error detection, RAS (Reliability, Availability, Serviceability) features, ECC error monitoring, and comprehensive error reporting for GPU reliability management.
3
4
## Capabilities
5
6
### ECC Error Counting
7
8
Get ECC (Error Correcting Code) error counts for specific GPU blocks.
9
10
```c { .api }
11
amdsmi_status_t amdsmi_get_gpu_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_error_count_t *ec);
12
amdsmi_status_t amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec);
13
```
14
15
**Parameters:**
16
- `processor_handle`: Handle to the GPU processor
17
- `block`: GPU block to query (UMC, SDMA, GFX, etc.)
18
- `ec`: Pointer to receive error count information
19
20
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
21
22
**Usage Example:**
23
24
```c
25
// Get ECC errors for UMC (memory controller) block
26
amdsmi_error_count_t ecc_count;
27
amdsmi_status_t ret = amdsmi_get_gpu_ecc_count(processor, AMDSMI_GPU_BLOCK_UMC, &ecc_count);
28
if (ret == AMDSMI_STATUS_SUCCESS) {
29
printf("UMC ECC Errors:\n");
30
printf(" Correctable: %llu\n", ecc_count.correctable_err);
31
printf(" Uncorrectable: %llu\n", ecc_count.uncorrectable_err);
32
}
33
34
// Get total ECC errors across all blocks
35
amdsmi_error_count_t total_ecc;
36
ret = amdsmi_get_gpu_total_ecc_count(processor, &total_ecc);
37
if (ret == AMDSMI_STATUS_SUCCESS) {
38
printf("Total ECC Errors:\n");
39
printf(" Correctable: %llu\n", total_ecc.correctable_err);
40
printf(" Uncorrectable: %llu\n", total_ecc.uncorrectable_err);
41
}
42
```
43
44
### ECC Configuration Status
45
46
Check which GPU blocks have ECC enabled and get ECC status.
47
48
```c { .api }
49
amdsmi_status_t amdsmi_get_gpu_ecc_enabled(amdsmi_processor_handle processor_handle, uint64_t *enabled_blocks);
50
amdsmi_status_t amdsmi_get_gpu_ecc_status(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_ras_err_state_t *state);
51
```
52
53
**Parameters:**
54
- `processor_handle`: Handle to the GPU processor
55
- `enabled_blocks`: Pointer to receive bitmask of ECC-enabled blocks
56
- `block`: GPU block to query for ECC status
57
- `state`: Pointer to receive ECC/RAS error state
58
59
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
60
61
**Usage Example:**
62
63
```c
64
// Check which blocks have ECC enabled
65
uint64_t enabled_blocks;
66
amdsmi_status_t ret = amdsmi_get_gpu_ecc_enabled(processor, &enabled_blocks);
67
if (ret == AMDSMI_STATUS_SUCCESS) {
68
printf("ECC Enabled Blocks: 0x%llx\n", enabled_blocks);
69
70
if (enabled_blocks & AMDSMI_GPU_BLOCK_UMC) {
71
printf(" UMC (Memory Controller) ECC enabled\n");
72
}
73
if (enabled_blocks & AMDSMI_GPU_BLOCK_SDMA) {
74
printf(" SDMA (System DMA) ECC enabled\n");
75
}
76
if (enabled_blocks & AMDSMI_GPU_BLOCK_GFX) {
77
printf(" GFX (Graphics) ECC enabled\n");
78
}
79
}
80
81
// Get ECC status for specific block
82
amdsmi_ras_err_state_t ecc_state;
83
ret = amdsmi_get_gpu_ecc_status(processor, AMDSMI_GPU_BLOCK_UMC, &ecc_state);
84
if (ret == AMDSMI_STATUS_SUCCESS) {
85
printf("UMC ECC Status: %d\n", ecc_state);
86
}
87
```
88
89
### RAS Block Features
90
91
Check if RAS features are enabled for specific GPU blocks.
92
93
```c { .api }
94
amdsmi_status_t amdsmi_get_gpu_ras_block_features_enabled(amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_ras_err_state_t *state);
95
```
96
97
**Parameters:**
98
- `processor_handle`: Handle to the GPU processor
99
- `block`: GPU block to query
100
- `state`: Pointer to receive RAS error state
101
102
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
103
104
### Status Code Translation
105
106
Convert AMD SMI status codes to human-readable strings.
107
108
```c { .api }
109
amdsmi_status_t amdsmi_status_code_to_string(amdsmi_status_t status, const char **status_string);
110
```
111
112
**Parameters:**
113
- `status`: Status code to convert
114
- `status_string`: Pointer to receive status string (library-owned memory)
115
116
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
117
118
**Usage Example:**
119
120
```c
121
amdsmi_status_t result = amdsmi_get_gpu_activity(processor, &activity);
122
if (result != AMDSMI_STATUS_SUCCESS) {
123
const char *error_string;
124
amdsmi_status_t str_ret = amdsmi_status_code_to_string(result, &error_string);
125
if (str_ret == AMDSMI_STATUS_SUCCESS) {
126
printf("Error getting GPU activity: %s\n", error_string);
127
} else {
128
printf("Error getting GPU activity: Status code %d\n", result);
129
}
130
}
131
```
132
133
### XGMI Error Management
134
135
Get and reset XGMI (Infinity Fabric) error status.
136
137
```c { .api }
138
amdsmi_status_t amdsmi_gpu_xgmi_error_status(amdsmi_processor_handle processor_handle, amdsmi_xgmi_status_t *status);
139
amdsmi_status_t amdsmi_reset_gpu_xgmi_error(amdsmi_processor_handle processor_handle);
140
```
141
142
**Parameters:**
143
- `processor_handle`: Handle to the GPU processor
144
- `status`: Pointer to receive XGMI error status
145
146
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
147
148
**Usage Example:**
149
150
```c
151
// Check XGMI error status
152
amdsmi_xgmi_status_t xgmi_status;
153
amdsmi_status_t ret = amdsmi_gpu_xgmi_error_status(processor, &xgmi_status);
154
if (ret == AMDSMI_STATUS_SUCCESS) {
155
printf("XGMI Status: %d\n", xgmi_status);
156
157
// If errors are present, reset them
158
if (xgmi_status != AMDSMI_XGMI_STATUS_NO_ERRORS) {
159
ret = amdsmi_reset_gpu_xgmi_error(processor);
160
if (ret == AMDSMI_STATUS_SUCCESS) {
161
printf("XGMI errors reset\n");
162
}
163
}
164
}
165
```
166
167
## Python API
168
169
### ECC Error Information
170
171
```python { .api }
172
def amdsmi_get_gpu_ecc_count(processor_handle, block):
173
"""
174
Get ECC error count for a specific GPU block.
175
176
Args:
177
processor_handle: GPU processor handle
178
block (AmdSmiGpuBlock): GPU block to query
179
180
Returns:
181
dict: Error counts with keys 'correctable_err', 'uncorrectable_err'
182
183
Raises:
184
AmdSmiException: If ECC count query fails
185
"""
186
187
def amdsmi_get_gpu_total_ecc_count(processor_handle):
188
"""
189
Get total ECC error count across all GPU blocks.
190
191
Args:
192
processor_handle: GPU processor handle
193
194
Returns:
195
dict: Total error counts with keys 'correctable_err', 'uncorrectable_err'
196
197
Raises:
198
AmdSmiException: If total ECC count query fails
199
"""
200
```
201
202
### ECC Status
203
204
```python { .api }
205
def amdsmi_get_gpu_ecc_enabled(processor_handle):
206
"""
207
Get bitmask of ECC-enabled GPU blocks.
208
209
Args:
210
processor_handle: GPU processor handle
211
212
Returns:
213
int: Bitmask of enabled blocks
214
215
Raises:
216
AmdSmiException: If ECC enabled query fails
217
"""
218
219
def amdsmi_get_gpu_ecc_status(processor_handle, block):
220
"""
221
Get ECC status for a specific GPU block.
222
223
Args:
224
processor_handle: GPU processor handle
225
block (AmdSmiGpuBlock): GPU block to query
226
227
Returns:
228
AmdSmiRasErrState: ECC/RAS error state
229
230
Raises:
231
AmdSmiException: If ECC status query fails
232
"""
233
```
234
235
### Error Handling
236
237
```python { .api }
238
def amdsmi_status_code_to_string(status):
239
"""
240
Convert status code to human-readable string.
241
242
Args:
243
status (AmdSmiRetCode): Status code to convert
244
245
Returns:
246
str: Human-readable error description
247
248
Raises:
249
AmdSmiException: If status code conversion fails
250
"""
251
```
252
253
**Python Usage Example:**
254
255
```python
256
import amdsmi
257
from amdsmi import AmdSmiGpuBlock, AmdSmiRasErrState
258
259
# Initialize and get GPU handle
260
amdsmi.amdsmi_init()
261
262
try:
263
sockets = amdsmi.amdsmi_get_socket_handles()
264
processors = amdsmi.amdsmi_get_processor_handles(sockets[0])
265
gpu = processors[0]
266
267
# Check which blocks have ECC enabled
268
enabled_blocks = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)
269
print(f"ECC Enabled Blocks: 0x{enabled_blocks:x}")
270
271
# Check specific blocks
272
if enabled_blocks & AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC:
273
print("UMC (Memory Controller) has ECC enabled")
274
275
# Get UMC ECC error count
276
umc_errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC)
277
print(f"UMC ECC Errors: {umc_errors['correctable_err']} correctable, "
278
f"{umc_errors['uncorrectable_err']} uncorrectable")
279
280
# Get UMC ECC status
281
umc_status = amdsmi.amdsmi_get_gpu_ecc_status(gpu, AmdSmiGpuBlock.AMDSMI_GPU_BLOCK_UMC)
282
print(f"UMC ECC Status: {umc_status}")
283
284
# Get total ECC errors
285
total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)
286
print(f"Total ECC Errors: {total_errors['correctable_err']} correctable, "
287
f"{total_errors['uncorrectable_err']} uncorrectable")
288
289
# Check XGMI errors (if applicable)
290
try:
291
xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
292
print(f"XGMI Status: {xgmi_status}")
293
except amdsmi.AmdSmiException as e:
294
print(f"XGMI status not available: {e}")
295
296
except amdsmi.AmdSmiException as e:
297
error_string = amdsmi.amdsmi_status_code_to_string(e.get_status_code())
298
print(f"Error: {error_string}")
299
300
finally:
301
amdsmi.amdsmi_shut_down()
302
```
303
304
## Types
305
306
### Error Count Structure
307
308
```c { .api }
309
typedef struct {
310
uint64_t correctable_err; // Number of correctable errors
311
uint64_t uncorrectable_err; // Number of uncorrectable errors
312
uint64_t reserved[2]; // Reserved for future use
313
} amdsmi_error_count_t;
314
```
315
316
### RAS Error States
317
318
```c { .api }
319
typedef enum {
320
AMDSMI_RAS_ERR_STATE_NONE = 0, // No current errors
321
AMDSMI_RAS_ERR_STATE_DISABLED, // ECC/RAS is disabled
322
AMDSMI_RAS_ERR_STATE_PARITY, // ECC errors present, type unknown
323
AMDSMI_RAS_ERR_STATE_SING_C, // Single correctable error
324
AMDSMI_RAS_ERR_STATE_MULT_UC, // Multiple uncorrectable errors
325
AMDSMI_RAS_ERR_STATE_POISON, // Firmware detected error, page isolated
326
AMDSMI_RAS_ERR_STATE_ENABLED // ECC/RAS is enabled
327
} amdsmi_ras_err_state_t;
328
```
329
330
### GPU Blocks
331
332
```c { .api }
333
typedef enum {
334
AMDSMI_GPU_BLOCK_UMC = 0x0000000000000001, // UMC (Unified Memory Controller)
335
AMDSMI_GPU_BLOCK_SDMA = 0x0000000000000002, // SDMA (System DMA)
336
AMDSMI_GPU_BLOCK_GFX = 0x0000000000000004, // GFX (Graphics)
337
AMDSMI_GPU_BLOCK_MMHUB = 0x0000000000000008, // MMHUB (Multimedia Hub)
338
AMDSMI_GPU_BLOCK_ATHUB = 0x0000000000000010, // ATHUB (ATI Hub)
339
AMDSMI_GPU_BLOCK_PCIE_BIF = 0x0000000000000020, // PCIe BIF
340
AMDSMI_GPU_BLOCK_HDP = 0x0000000000000040, // HDP (Host Data Path)
341
AMDSMI_GPU_BLOCK_XGMI_WAFL = 0x0000000000000080,// XGMI
342
AMDSMI_GPU_BLOCK_DF = 0x0000000000000100, // Data Fabric
343
AMDSMI_GPU_BLOCK_SMN = 0x0000000000000200, // System Memory Network
344
AMDSMI_GPU_BLOCK_SEM = 0x0000000000000400, // SEM
345
AMDSMI_GPU_BLOCK_MP0 = 0x0000000000000800, // MP0 (Microprocessor 0)
346
AMDSMI_GPU_BLOCK_MP1 = 0x0000000000001000, // MP1 (Microprocessor 1)
347
AMDSMI_GPU_BLOCK_FUSE = 0x0000000000002000 // Fuse
348
} amdsmi_gpu_block_t;
349
```
350
351
### XGMI Status
352
353
```c { .api }
354
typedef enum {
355
AMDSMI_XGMI_STATUS_NO_ERRORS = 0, // No XGMI errors
356
AMDSMI_XGMI_STATUS_ERROR, // XGMI errors present
357
AMDSMI_XGMI_STATUS_MULTIPLE_ERRORS // Multiple XGMI errors
358
} amdsmi_xgmi_status_t;
359
```
360
361
### Status Codes (Key Error Codes)
362
363
```c { .api }
364
typedef enum {
365
AMDSMI_STATUS_SUCCESS = 0, // Call succeeded
366
AMDSMI_STATUS_INVAL = 1, // Invalid parameters
367
AMDSMI_STATUS_NOT_SUPPORTED = 2, // Command not supported
368
AMDSMI_STATUS_NOT_YET_IMPLEMENTED = 3, // Not implemented yet
369
AMDSMI_STATUS_FAIL_LOAD_MODULE = 4, // Failed to load library
370
AMDSMI_STATUS_FAIL_LOAD_SYMBOL = 5, // Failed to load symbol
371
AMDSMI_STATUS_DRM_ERROR = 6, // Error when calling libdrm
372
AMDSMI_STATUS_API_FAILED = 7, // API call failed
373
AMDSMI_STATUS_TIMEOUT = 8, // Timeout in API call
374
AMDSMI_STATUS_RETRY = 9, // Retry operation
375
AMDSMI_STATUS_NO_PERM = 10, // Permission denied
376
AMDSMI_STATUS_INTERRUPT = 11, // Interrupt occurred during execution
377
AMDSMI_STATUS_IO = 12, // I/O Error
378
AMDSMI_STATUS_ADDRESS_FAULT = 13, // Bad address
379
AMDSMI_STATUS_FILE_ERROR = 14, // Problem accessing a file
380
AMDSMI_STATUS_OUT_OF_RESOURCES = 15, // Not enough memory
381
AMDSMI_STATUS_INTERNAL_EXCEPTION = 16, // Internal exception caught
382
AMDSMI_STATUS_INPUT_OUT_OF_BOUNDS = 17, // Input out of allowable range
383
AMDSMI_STATUS_INIT_ERROR = 18, // Error during initialization
384
AMDSMI_STATUS_REFCOUNT_OVERFLOW = 19, // Reference counter overflow
385
AMDSMI_STATUS_BUSY = 30, // Device busy
386
AMDSMI_STATUS_NOT_FOUND = 31, // Device not found
387
AMDSMI_STATUS_NOT_INIT = 32, // Device not initialized
388
AMDSMI_STATUS_NO_SLOT = 33, // No more free slot
389
AMDSMI_STATUS_NO_DATA = 40, // No data found
390
AMDSMI_STATUS_INSUFFICIENT_SIZE = 41, // Not enough resources
391
AMDSMI_STATUS_UNEXPECTED_SIZE = 42, // Unexpected amount of data
392
AMDSMI_STATUS_UNEXPECTED_DATA = 43, // Unexpected data
393
AMDSMI_STATUS_MAP_ERROR = 0xFFFFFFFE, // Internal library error mapping failed
394
AMDSMI_STATUS_UNKNOWN_ERROR = 0xFFFFFFFF // Unknown error occurred
395
} amdsmi_status_t;
396
```
397
398
## Error Handling Best Practices
399
400
1. **Always Check Return Values**: Every AMD SMI function returns a status code that should be checked.
401
402
2. **Use Status Code Translation**: Convert error codes to strings for better error reporting.
403
404
3. **Monitor ECC Errors**: Regularly check ECC error counts, especially in production environments.
405
406
4. **Handle Uncorrectable Errors**: Uncorrectable ECC errors may indicate serious hardware issues.
407
408
5. **Log Error Information**: Maintain logs of error counts and types for trend analysis.
409
410
6. **Reset XGMI Errors**: Clear XGMI errors after handling to prevent false positives.
411
412
## Important Notes
413
414
1. **ECC Availability**: ECC features depend on GPU model and may not be available on all devices.
415
416
2. **Error Types**:
417
- **Correctable Errors**: Automatically fixed by ECC, but indicate potential issues
418
- **Uncorrectable Errors**: Cannot be corrected, may cause data corruption or crashes
419
420
3. **GPU Blocks**: Different GPU blocks have different error reporting capabilities.
421
422
4. **RAS Features**: Reliability, Availability, and Serviceability features provide enterprise-grade error management.
423
424
5. **XGMI Errors**: Only relevant in multi-GPU systems with Infinity Fabric connections.
425
426
6. **Virtual Machine Limitations**: Some RAS features may not be available in virtualized environments.
427
428
7. **Production Monitoring**: ECC error monitoring is critical for data center and HPC environments.
429
430
8. **Error Thresholds**: Establish monitoring thresholds for correctable errors to predict hardware issues.