Functions for monitoring GPU errors, ECC (Error Correcting Code) status, and RAS (Reliability, Availability, and Serviceability) features. These functions provide access to hardware error counters, ECC memory protection status, RAS feature configuration, and CPER (Common Platform Error Record) entries for comprehensive GPU health monitoring and reliability analysis.
Query total ECC error counts across all GPU blocks.
def amdsmi_get_gpu_total_ecc_count(
processor_handle: processor_handle
) -> Dict[str, int]:
"""
Get total ECC error count across all GPU blocks.
Retrieves cumulative ECC error counts for the entire GPU, including correctable,
uncorrectable, and deferred errors. This provides an overall view of memory
reliability and error rates across all ECC-protected blocks.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
Returns:
- Dict[str, int]: Dictionary containing total error counts with keys:
- 'correctable_count' (int): Total number of correctable ECC errors detected
and fixed by hardware. These errors don't affect data integrity.
- 'uncorrectable_count' (int): Total number of uncorrectable ECC errors
detected. These errors indicate data corruption and potential reliability issues.
- 'deferred_count' (int): Total number of deferred errors. These are
uncorrectable errors that haven't yet caused an immediate failure.
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve ECC counts or ECC not supported
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Get total ECC error counts
ecc_counts = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)
print(f"Total ECC Errors:")
print(f" Correctable: {ecc_counts['correctable_count']}")
print(f" Uncorrectable: {ecc_counts['uncorrectable_count']}")
print(f" Deferred: {ecc_counts['deferred_count']}")
# Check for reliability issues
if ecc_counts['uncorrectable_count'] > 0:
print("WARNING: Uncorrectable ECC errors detected!")
if ecc_counts['correctable_count'] > 1000:
print("NOTICE: High correctable error rate may indicate degradation")
finally:
amdsmi.amdsmi_shut_down()
```
"""Query ECC error counts for a specific GPU block.
def amdsmi_get_gpu_ecc_count(
processor_handle: processor_handle,
block: AmdSmiGpuBlock
) -> Dict[str, int]:
"""
Get ECC error count for a specific GPU block.
Retrieves ECC error counts for a particular hardware block on the GPU,
allowing pinpointing of which component is experiencing errors. This is
useful for diagnosing specific hardware issues.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
- block (AmdSmiGpuBlock): The GPU block to query for errors. Common blocks:
- UMC: Unified Memory Controller (memory interface)
- SDMA: System DMA engine
- GFX: Graphics engine
- MMHUB: Multimedia hub
- HDP: Host Data Path
- XGMI_WAFL: XGMI interconnect
- And other blocks (see AmdSmiGpuBlock enum)
Returns:
- Dict[str, int]: Dictionary containing error counts for the specified block:
- 'correctable_count' (int): Correctable errors in this block
- 'uncorrectable_count' (int): Uncorrectable errors in this block
- 'deferred_count' (int): Deferred errors in this block
Raises:
- AmdSmiParameterException: If processor_handle or block is invalid
- AmdSmiLibraryException: If unable to retrieve ECC counts for the block
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Check UMC (memory controller) for errors
umc_errors = amdsmi.amdsmi_get_gpu_ecc_count(
gpu,
amdsmi.AmdSmiGpuBlock.UMC
)
print(f"UMC (Memory Controller) Errors:")
print(f" Correctable: {umc_errors['correctable_count']}")
print(f" Uncorrectable: {umc_errors['uncorrectable_count']}")
# Check graphics engine for errors
gfx_errors = amdsmi.amdsmi_get_gpu_ecc_count(
gpu,
amdsmi.AmdSmiGpuBlock.GFX
)
print(f"\nGFX (Graphics Engine) Errors:")
print(f" Correctable: {gfx_errors['correctable_count']}")
print(f" Uncorrectable: {gfx_errors['uncorrectable_count']}")
# Scan all blocks for errors
print("\nAll GPU Block Error Summary:")
for block in amdsmi.AmdSmiGpuBlock:
if block.name in ["INVALID", "RESERVED"]:
continue
try:
errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, block)
total = (errors['correctable_count'] +
errors['uncorrectable_count'] +
errors['deferred_count'])
if total > 0:
print(f" {block.name}: {total} errors")
except amdsmi.AmdSmiLibraryException:
pass # Block not available or not monitored
finally:
amdsmi.amdsmi_shut_down()
```
"""Check which GPU blocks have ECC protection enabled.
def amdsmi_get_gpu_ecc_enabled(
processor_handle: processor_handle
) -> int:
"""
Get bitmask indicating which GPU blocks have ECC enabled.
Returns a 64-bit bitmask where each bit represents whether ECC protection
is enabled for a corresponding GPU block. This allows checking the ECC
configuration across all GPU components.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
Returns:
- int: 64-bit bitmask where each bit corresponds to a GPU block.
A bit value of 1 indicates ECC is enabled for that block.
Bit positions correspond to AmdSmiGpuBlock enum values.
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve ECC enabled status
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Get ECC enabled bitmask
ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)
print(f"ECC Enabled Bitmask: 0x{ecc_bitmask:016X}")
# Check if specific blocks have ECC enabled
print("\nECC Status by Block:")
for block in amdsmi.AmdSmiGpuBlock:
if block.name in ["INVALID", "RESERVED"]:
continue
# Check if bit for this block is set
if ecc_bitmask & (1 << block.value):
print(f" {block.name}: ENABLED")
else:
print(f" {block.name}: DISABLED")
# Check if any ECC is enabled
if ecc_bitmask == 0:
print("\nWARNING: No ECC protection enabled on this GPU")
else:
print(f"\nECC protection is enabled on {bin(ecc_bitmask).count('1')} blocks")
finally:
amdsmi.amdsmi_shut_down()
```
"""Query ECC status/state for a specific GPU block.
def amdsmi_get_gpu_ecc_status(
processor_handle: processor_handle,
block: AmdSmiGpuBlock
) -> AmdSmiRasErrState:
"""
Get ECC status for a specific GPU block.
Retrieves the current RAS error state for a particular GPU block, indicating
what type of error correction is active or if errors have been detected.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
- block (AmdSmiGpuBlock): The GPU block to query for ECC status
Returns:
- AmdSmiRasErrState: The current RAS error state for the block:
- NONE: No RAS support or no errors
- DISABLED: RAS/ECC is disabled for this block
- PARITY: Parity checking is enabled
- SING_C: Single-bit correction enabled
- MULT_UC: Multi-bit uncorrectable error detected
- POISON: Poison bit error correction enabled
- ENABLED: RAS/ECC is enabled (general)
- INVALID: Invalid state
Raises:
- AmdSmiParameterException: If processor_handle or block is invalid
- AmdSmiLibraryException: If unable to retrieve ECC status
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Check ECC status for memory controller
umc_status = amdsmi.amdsmi_get_gpu_ecc_status(
gpu,
amdsmi.AmdSmiGpuBlock.UMC
)
print(f"UMC ECC Status: {umc_status.name}")
# Check status across all blocks
print("\nECC Status by Block:")
for block in amdsmi.AmdSmiGpuBlock:
if block.name in ["INVALID", "RESERVED"]:
continue
try:
status = amdsmi.amdsmi_get_gpu_ecc_status(gpu, block)
if status != amdsmi.AmdSmiRasErrState.NONE:
print(f" {block.name}: {status.name}")
# Highlight concerning states
if status == amdsmi.AmdSmiRasErrState.MULT_UC:
print(f" *** CRITICAL: Multi-bit uncorrectable error! ***")
elif status == amdsmi.AmdSmiRasErrState.DISABLED:
print(f" Warning: ECC disabled")
except amdsmi.AmdSmiLibraryException:
pass # Block not available
finally:
amdsmi.amdsmi_shut_down()
```
"""Query RAS feature configuration and capabilities.
def amdsmi_get_gpu_ras_feature_info(
processor_handle: processor_handle
) -> Dict[str, Any]:
"""
Get RAS (Reliability, Availability, Serviceability) feature information.
Retrieves information about the GPU's RAS capabilities, including the
EEPROM version and supported error correction schemas. This provides
insight into the GPU's error detection and correction capabilities.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
Returns:
- Dict[str, Any]: Dictionary containing RAS feature information:
- 'eeprom_version' (str): RAS EEPROM version as hex string
- 'parity_schema' (bool): True if parity checking is supported
- 'single_bit_schema' (bool): True if single-bit error correction supported
- 'double_bit_schema' (bool): True if double-bit error detection supported
- 'poison_schema' (bool): True if poison bit error handling supported
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve RAS feature info
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Get RAS feature information
ras_info = amdsmi.amdsmi_get_gpu_ras_feature_info(gpu)
print("RAS Feature Information:")
print(f" EEPROM Version: {ras_info['eeprom_version']}")
print(f"\nSupported Error Correction Schemas:")
print(f" Parity Checking: {ras_info['parity_schema']}")
print(f" Single-bit Correction: {ras_info['single_bit_schema']}")
print(f" Double-bit Detection: {ras_info['double_bit_schema']}")
print(f" Poison Bit Handling: {ras_info['poison_schema']}")
# Determine overall RAS capability level
if ras_info['double_bit_schema'] and ras_info['single_bit_schema']:
print("\nRAS Capability: Advanced (SECDED - Single Error Correction, Double Error Detection)")
elif ras_info['single_bit_schema']:
print("\nRAS Capability: Basic (Single-bit correction)")
elif ras_info['parity_schema']:
print("\nRAS Capability: Minimal (Parity only)")
else:
print("\nRAS Capability: Limited or not configured")
finally:
amdsmi.amdsmi_shut_down()
```
"""Query which RAS features are enabled for each GPU block.
def amdsmi_get_gpu_ras_block_features_enabled(
processor_handle: processor_handle
) -> List[Dict[str, Any]]:
"""
Get RAS features enabled status for all GPU blocks.
Retrieves the RAS error state for each GPU block, providing a comprehensive
view of which blocks have error protection enabled and their current status.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
Returns:
- List[Dict[str, Any]]: List of dictionaries, one per GPU block, containing:
- 'block' (str): Name of the GPU block (e.g., "UMC", "SDMA", "GFX")
- 'status' (str): RAS error state name (e.g., "ENABLED", "DISABLED", "SING_C")
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve RAS block features
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Get RAS features for all blocks
ras_blocks = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(gpu)
print("RAS Features by GPU Block:")
print(f"{'Block':<15} {'Status':<20}")
print("-" * 35)
enabled_count = 0
disabled_count = 0
for block_info in ras_blocks:
block_name = block_info['block']
status = block_info['status']
print(f"{block_name:<15} {status:<20}")
if status == "ENABLED" or status == "SING_C":
enabled_count += 1
elif status == "DISABLED":
disabled_count += 1
print("\nSummary:")
print(f" Blocks with RAS enabled: {enabled_count}")
print(f" Blocks with RAS disabled: {disabled_count}")
print(f" Total blocks: {len(ras_blocks)}")
# Filter for critical blocks with issues
print("\nCritical Blocks Status:")
critical_blocks = ["UMC", "GFX", "SDMA", "HDP"]
for block_info in ras_blocks:
if block_info['block'] in critical_blocks:
status = block_info['status']
status_indicator = "✓" if status != "DISABLED" else "✗"
print(f" {status_indicator} {block_info['block']}: {status}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Retrieve CPER (Common Platform Error Record) entries.
def amdsmi_get_gpu_cper_entries(
processor_handle: processor_handle,
severity_mask: int,
buffer_size: int = 4 * 1048576,
cursor: int = 0
) -> Tuple[Dict[str, Any], int, List[Dict[str, Any]], int]:
"""
Get CPER (Common Platform Error Record) entries from the GPU.
CPER is a standardized error record format used in UEFI and ACPI for
reporting hardware errors. This function retrieves raw CPER records
from the GPU, which can be parsed to extract detailed error information.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
- severity_mask (int): Bitmask to filter CPER entries by severity level.
Combine severity levels using bitwise OR:
- 0x01: Recoverable errors
- 0x02: Fatal errors
- 0x04: Corrected errors
- 0x08: Informational records
- buffer_size (int, optional): Size of buffer for CPER data in bytes.
Default is 4MB (4 * 1048576). Increase if expecting many records.
- cursor (int, optional): Starting position for reading records, used for
pagination. Default is 0 (start from beginning).
Returns:
- Tuple containing:
1. entries (Dict[str, Any]): Placeholder dictionary for entry metadata
2. buffer_size_used (int): Actual number of bytes used in buffer
3. cper_data (List[Dict[str, Any]]): List of CPER records, each containing:
- 'bytes' (List[int]): Raw CPER record as list of bytes
- 'size' (int): Size of this CPER record in bytes
4. next_cursor (int): Cursor position for next read (for pagination)
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve CPER entries
Note:
- Returns AMDSMI_STATUS_SUCCESS when all entries retrieved
- Returns AMDSMI_STATUS_MORE_DATA if more entries available (use cursor for pagination)
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
# Get all CPER entries (all severity levels)
severity_mask = 0x0F # All severities
entries, buf_size, cper_data, next_cursor = amdsmi.amdsmi_get_gpu_cper_entries(
gpu,
severity_mask
)
print(f"Retrieved {len(cper_data)} CPER entries")
print(f"Buffer size used: {buf_size} bytes")
# Display CPER record information
for i, record in enumerate(cper_data):
print(f"\nCPER Record {i}:")
print(f" Size: {record['size']} bytes")
print(f" Data: {len(record['bytes'])} bytes")
# First few bytes often contain header info
if len(record['bytes']) >= 16:
header_bytes = record['bytes'][:16]
print(f" Header: {' '.join(f'{b:02x}' for b in header_bytes)}")
# Get only corrected errors
print("\n--- Corrected Errors Only ---")
corrected_mask = 0x04
entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
gpu,
corrected_mask
)
print(f"Corrected error records: {len(cper_data)}")
# Get only fatal errors
print("\n--- Fatal Errors Only ---")
fatal_mask = 0x02
entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
gpu,
fatal_mask
)
print(f"Fatal error records: {len(cper_data)}")
if len(cper_data) > 0:
print("WARNING: Fatal errors detected!")
finally:
amdsmi.amdsmi_shut_down()
```
"""class AmdSmiGpuBlock(IntEnum):
"""
GPU hardware block identifiers for error reporting and monitoring.
These enums identify different functional blocks within the GPU that
can be monitored for errors, ECC status, and RAS features.
"""
INVALID = ... # Invalid block identifier
UMC = ... # Unified Memory Controller (memory interface)
SDMA = ... # System DMA engine
GFX = ... # Graphics engine (compute/graphics shader cores)
MMHUB = ... # Multimedia hub
ATHUB = ... # Address Translation Hub
PCIE_BIF = ... # PCIe Bus Interface
HDP = ... # Host Data Path
XGMI_WAFL = ... # XGMI interconnect (GPU-to-GPU communication)
DF = ... # Data Fabric
SMN = ... # System Management Network
SEM = ... # Sensor Management
MP0 = ... # Management Processor 0
MP1 = ... # Management Processor 1
FUSE = ... # Fuse controller
MCA = ... # Machine Check Architecture
VCN = ... # Video Core Next (video encode/decode)
JPEG = ... # JPEG engine
IH = ... # Interrupt Handler
MPIO = ... # Multi-Purpose I/O
RESERVED = ... # Reserved blockclass AmdSmiRasErrState(IntEnum):
"""
RAS (Reliability, Availability, Serviceability) error states.
These states indicate the current error detection and correction
status for a GPU block.
"""
NONE = ... # No RAS support or no errors detected
DISABLED = ... # RAS/ECC is disabled for this block
PARITY = ... # Parity checking is enabled (basic error detection)
SING_C = ... # Single-bit error correction enabled (SEC)
MULT_UC = ... # Multi-bit uncorrectable error detected (critical)
POISON = ... # Poison bit error correction enabled
ENABLED = ... # RAS/ECC is enabled (general state)
INVALID = ... # Invalid stateMonitor ECC error counts across all GPU blocks:
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
for i, gpu in enumerate(devices):
print(f"\nGPU {i} - ECC Error Summary:")
# Get total error counts
try:
total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)
print(f" Total Correctable Errors: {total_errors['correctable_count']}")
print(f" Total Uncorrectable Errors: {total_errors['uncorrectable_count']}")
print(f" Total Deferred Errors: {total_errors['deferred_count']}")
# Check for concerning error levels
if total_errors['uncorrectable_count'] > 0:
print(" *** WARNING: Uncorrectable errors detected! ***")
if total_errors['correctable_count'] > 10000:
print(" *** NOTICE: High correctable error rate ***")
except amdsmi.AmdSmiLibraryException as e:
print(f" ECC not supported or error retrieving counts: {e}")
finally:
amdsmi.amdsmi_shut_down()Scan all GPU blocks for errors and identify problem areas:
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
print("GPU Block Error Analysis:")
print(f"{'Block':<15} {'Correctable':<12} {'Uncorrectable':<15} {'Deferred':<10} {'Total':<10}")
print("-" * 70)
error_blocks = []
for block in amdsmi.AmdSmiGpuBlock:
if block.name in ["INVALID", "RESERVED"]:
continue
try:
errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, block)
corr = errors['correctable_count']
uncorr = errors['uncorrectable_count']
defer = errors['deferred_count']
total = corr + uncorr + defer
# Only display blocks with errors
if total > 0:
print(f"{block.name:<15} {corr:<12} {uncorr:<15} {defer:<10} {total:<10}")
if uncorr > 0:
error_blocks.append((block.name, uncorr))
except amdsmi.AmdSmiLibraryException:
pass # Block not available
# Highlight critical blocks with uncorrectable errors
if error_blocks:
print("\nCRITICAL: Blocks with uncorrectable errors:")
for block_name, count in sorted(error_blocks, key=lambda x: x[1], reverse=True):
print(f" {block_name}: {count} uncorrectable errors")
else:
print("\nNo uncorrectable errors detected.")
finally:
amdsmi.amdsmi_shut_down()Check RAS capabilities and configuration:
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
print("=== RAS Feature Information ===\n")
# Get RAS feature info
ras_info = amdsmi.amdsmi_get_gpu_ras_feature_info(gpu)
print(f"EEPROM Version: {ras_info['eeprom_version']}")
print("\nSupported Error Correction Schemas:")
print(f" Parity: {'Yes' if ras_info['parity_schema'] else 'No'}")
print(f" Single-bit (SEC): {'Yes' if ras_info['single_bit_schema'] else 'No'}")
print(f" Double-bit (DED): {'Yes' if ras_info['double_bit_schema'] else 'No'}")
print(f" Poison: {'Yes' if ras_info['poison_schema'] else 'No'}")
# Determine ECC capability level
if ras_info['single_bit_schema'] and ras_info['double_bit_schema']:
capability = "SECDED (Single Error Correction, Double Error Detection)"
elif ras_info['single_bit_schema']:
capability = "SEC (Single Error Correction)"
elif ras_info['parity_schema']:
capability = "Parity Only"
else:
capability = "None or Unknown"
print(f"\nECC Capability Level: {capability}")
# Get which blocks have ECC enabled
print("\n=== ECC Enabled Status ===\n")
ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)
enabled_blocks = []
for block in amdsmi.AmdSmiGpuBlock:
if block.name in ["INVALID", "RESERVED"]:
continue
if ecc_bitmask & (1 << block.value):
enabled_blocks.append(block.name)
if enabled_blocks:
print(f"ECC Enabled on {len(enabled_blocks)} blocks:")
for block in enabled_blocks:
print(f" - {block}")
else:
print("No blocks have ECC enabled")
# Get detailed RAS status per block
print("\n=== RAS Block Features ===\n")
ras_blocks = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(gpu)
for block_info in ras_blocks:
block = block_info['block']
status = block_info['status']
# Highlight important blocks
if block in ["UMC", "GFX", "SDMA", "HDP"]:
marker = "***" if status == "DISABLED" else " "
print(f"{marker} {block:<15} {status}")
finally:
amdsmi.amdsmi_shut_down()Comprehensive ECC health monitoring function:
import amdsmi
def check_ecc_health(gpu_handle, gpu_id):
"""Perform comprehensive ECC health check."""
print(f"\n{'='*60}")
print(f"GPU {gpu_id} - ECC Health Check")
print(f"{'='*60}\n")
health_ok = True
# Check if ECC is enabled
try:
ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu_handle)
if ecc_bitmask == 0:
print("⚠ WARNING: ECC is not enabled on any GPU blocks")
health_ok = False
else:
enabled_count = bin(ecc_bitmask).count('1')
print(f"✓ ECC enabled on {enabled_count} blocks")
except amdsmi.AmdSmiLibraryException as e:
print(f"✗ Unable to check ECC enabled status: {e}")
health_ok = False
# Check total error counts
try:
total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu_handle)
print(f"\nTotal Error Counts:")
print(f" Correctable: {total_errors['correctable_count']}")
print(f" Uncorrectable: {total_errors['uncorrectable_count']}")
print(f" Deferred: {total_errors['deferred_count']}")
# Check for critical issues
if total_errors['uncorrectable_count'] > 0:
print(f"\n✗ CRITICAL: {total_errors['uncorrectable_count']} uncorrectable errors detected!")
health_ok = False
# Check for high correctable error rate
if total_errors['correctable_count'] > 10000:
print(f"\n⚠ WARNING: High correctable error rate ({total_errors['correctable_count']})")
print(" This may indicate memory degradation")
health_ok = False
elif total_errors['correctable_count'] > 0:
print(f"\nℹ {total_errors['correctable_count']} correctable errors (within normal range)")
except amdsmi.AmdSmiLibraryException as e:
print(f"✗ Unable to retrieve error counts: {e}")
health_ok = False
# Check critical blocks
print(f"\nCritical Block Status:")
critical_blocks = [
amdsmi.AmdSmiGpuBlock.UMC, # Memory controller
amdsmi.AmdSmiGpuBlock.GFX, # Graphics engine
amdsmi.AmdSmiGpuBlock.HDP, # Host data path
amdsmi.AmdSmiGpuBlock.SDMA, # System DMA
]
for block in critical_blocks:
try:
# Check ECC status
status = amdsmi.amdsmi_get_gpu_ecc_status(gpu_handle, block)
# Check error counts
errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu_handle, block)
uncorr = errors['uncorrectable_count']
status_str = f"{block.name:<10} Status: {status.name:<10}"
if uncorr > 0:
print(f" ✗ {status_str} Uncorrectable: {uncorr}")
health_ok = False
elif status == amdsmi.AmdSmiRasErrState.DISABLED:
print(f" ⚠ {status_str} (ECC disabled)")
else:
print(f" ✓ {status_str}")
except amdsmi.AmdSmiLibraryException:
print(f" - {block.name:<10} Not available")
# Overall health verdict
print(f"\n{'='*60}")
if health_ok:
print("✓ ECC Health: GOOD")
else:
print("✗ ECC Health: ISSUES DETECTED - Investigation recommended")
print(f"{'='*60}")
return health_ok
# Example usage
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
all_healthy = True
for i, gpu in enumerate(devices):
is_healthy = check_ecc_health(gpu, i)
all_healthy = all_healthy and is_healthy
if not all_healthy:
print("\n*** Some GPUs have ECC issues - check logs above ***")
finally:
amdsmi.amdsmi_shut_down()Retrieve and analyze CPER error records:
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
print("Retrieving CPER Error Records...\n")
# Define severity levels
severity_levels = {
"Recoverable": 0x01,
"Fatal": 0x02,
"Corrected": 0x04,
"Informational": 0x08,
}
# Get all error records
all_severity = 0x0F # All levels
try:
entries, buf_size, cper_data, cursor = amdsmi.amdsmi_get_gpu_cper_entries(
gpu,
all_severity
)
print(f"Total CPER records retrieved: {len(cper_data)}")
print(f"Buffer size used: {buf_size} bytes\n")
if len(cper_data) == 0:
print("No CPER error records found.")
else:
# Analyze each record
for i, record in enumerate(cper_data):
print(f"Record {i}:")
print(f" Size: {record['size']} bytes")
# Display first 32 bytes of record (header region)
if len(record['bytes']) >= 32:
header = record['bytes'][:32]
print(f" Header (first 32 bytes):")
for j in range(0, 32, 16):
hex_str = ' '.join(f'{b:02x}' for b in header[j:j+16])
print(f" {hex_str}")
print()
# Get records by severity level
print("\nRecords by Severity Level:")
for severity_name, severity_mask in severity_levels.items():
try:
entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
gpu,
severity_mask
)
print(f" {severity_name}: {len(cper_data)} records")
if severity_name == "Fatal" and len(cper_data) > 0:
print(" *** CRITICAL: Fatal errors present! ***")
except amdsmi.AmdSmiLibraryException:
print(f" {severity_name}: Unable to retrieve")
except amdsmi.AmdSmiLibraryException as e:
print(f"Error retrieving CPER entries: {e}")
finally:
amdsmi.amdsmi_shut_down()Monitor ECC errors over time to detect degradation:
import amdsmi
import time
from collections import defaultdict
class ECCMonitor:
"""Monitor ECC errors over time."""
def __init__(self, gpu_handle):
self.gpu_handle = gpu_handle
self.history = []
self.baseline = None
def capture_baseline(self):
"""Capture initial error counts as baseline."""
try:
counts = amdsmi.amdsmi_get_gpu_total_ecc_count(self.gpu_handle)
self.baseline = counts.copy()
print(f"Baseline captured: {counts}")
except amdsmi.AmdSmiLibraryException as e:
print(f"Unable to capture baseline: {e}")
def check_errors(self):
"""Check current error counts and compare to baseline."""
try:
counts = amdsmi.amdsmi_get_gpu_total_ecc_count(self.gpu_handle)
if self.baseline:
delta = {
'correctable_count': counts['correctable_count'] - self.baseline['correctable_count'],
'uncorrectable_count': counts['uncorrectable_count'] - self.baseline['uncorrectable_count'],
'deferred_count': counts['deferred_count'] - self.baseline['deferred_count'],
}
else:
delta = None
self.history.append({
'timestamp': time.time(),
'counts': counts,
'delta': delta,
})
return counts, delta
except amdsmi.AmdSmiLibraryException as e:
print(f"Error checking ECC counts: {e}")
return None, None
def get_error_rate(self, window_seconds=60):
"""Calculate error rate over time window."""
now = time.time()
cutoff = now - window_seconds
recent = [h for h in self.history if h['timestamp'] >= cutoff]
if len(recent) < 2:
return None
first = recent[0]['counts']
last = recent[-1]['counts']
duration = recent[-1]['timestamp'] - recent[0]['timestamp']
if duration == 0:
return None
return {
'correctable_per_second': (last['correctable_count'] - first['correctable_count']) / duration,
'uncorrectable_per_second': (last['uncorrectable_count'] - first['uncorrectable_count']) / duration,
'duration': duration,
}
# Example: Monitor for 5 minutes
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
gpu = devices[0]
monitor = ECCMonitor(gpu)
monitor.capture_baseline()
print("\nMonitoring ECC errors for 5 minutes...")
print("Press Ctrl+C to stop early\n")
try:
for i in range(60): # 5 minutes with 5-second intervals
time.sleep(5)
counts, delta = monitor.check_errors()
if counts and delta:
print(f"[{i*5:3d}s] Correctable: +{delta['correctable_count']:3d} "
f"Uncorrectable: +{delta['uncorrectable_count']:3d} "
f"Deferred: +{delta['deferred_count']:3d}")
# Alert on new uncorrectable errors
if delta['uncorrectable_count'] > 0:
print(f" *** ALERT: New uncorrectable errors detected! ***")
# Show error rate every 60 seconds
if (i + 1) % 12 == 0: # Every 60 seconds
rate = monitor.get_error_rate(60)
if rate:
print(f"\n Error Rate (last 60s):")
print(f" Correctable: {rate['correctable_per_second']:.4f} errors/sec")
print(f" Uncorrectable: {rate['uncorrectable_per_second']:.4f} errors/sec\n")
except KeyboardInterrupt:
print("\nMonitoring stopped by user")
# Final summary
print("\n=== Monitoring Summary ===")
if monitor.baseline and len(monitor.history) > 0:
final = monitor.history[-1]['counts']
print(f"Initial errors: {monitor.baseline}")
print(f"Final errors: {final}")
print(f"\nTotal new errors:")
print(f" Correctable: {final['correctable_count'] - monitor.baseline['correctable_count']}")
print(f" Uncorrectable: {final['uncorrectable_count'] - monitor.baseline['uncorrectable_count']}")
print(f" Deferred: {final['deferred_count'] - monitor.baseline['deferred_count']}")
finally:
amdsmi.amdsmi_shut_down()Critical blocks to monitor:
Query and reset XGMI link error status.
def amdsmi_gpu_xgmi_error_status(processor_handle: processor_handle) -> AmdSmiXgmiStatus:
"""
Get XGMI error status for a GPU.
Retrieves the current XGMI (AMD Infinity Fabric) link error status. XGMI is
the high-speed interconnect used for GPU-to-GPU communication in multi-GPU
systems. This function helps detect link errors that may affect communication
between GPUs.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device to query
Returns:
- AmdSmiXgmiStatus: XGMI status enumeration value indicating error state
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to retrieve XGMI error status
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
# Check XGMI error status for each GPU
print("XGMI Error Status:")
for i, gpu in enumerate(devices):
try:
xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
print(f" GPU {i}: {xgmi_status}")
# Check for errors (non-zero status typically indicates errors)
if xgmi_status != 0:
print(f" WARNING: GPU {i} has XGMI errors!")
except amdsmi.AmdSmiLibraryException as e:
print(f" GPU {i}: XGMI status not available")
finally:
amdsmi.amdsmi_shut_down()
```
Note:
- Only applicable to systems with multiple GPUs connected via XGMI
- Returns error status for the XGMI links connected to this GPU
- Non-zero values typically indicate link errors or degraded performance
"""
def amdsmi_reset_gpu_xgmi_error(processor_handle: processor_handle) -> None:
"""
Reset XGMI error counters for a GPU.
Clears accumulated XGMI error counters, allowing fresh error monitoring.
This is useful after diagnosing and addressing XGMI link issues, or for
establishing a clean baseline for error monitoring.
Parameters:
- processor_handle (processor_handle): Handle for the GPU device
Returns:
- None
Raises:
- AmdSmiParameterException: If processor_handle is invalid
- AmdSmiLibraryException: If unable to reset XGMI errors (may require root privileges)
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
devices = amdsmi.amdsmi_get_processor_handles()
for i, gpu in enumerate(devices):
try:
# Check current XGMI error status
xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
print(f"GPU {i} XGMI status: {xgmi_status}")
if xgmi_status != 0:
print(f" Resetting XGMI errors on GPU {i}...")
amdsmi.amdsmi_reset_gpu_xgmi_error(gpu)
# Verify errors were cleared
new_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
print(f" After reset: {new_status}")
except amdsmi.AmdSmiLibraryException as e:
print(f" GPU {i}: Cannot reset XGMI errors")
finally:
amdsmi.amdsmi_shut_down()
```
Note:
- Typically requires root/administrator privileges
- Only applicable to systems with XGMI-connected GPUs
- Resets error counters but does not fix underlying hardware issues
- Use after addressing physical link problems or for baseline establishment
"""amdsmi_get_gpu_memory_reserved_pages(): Shows memory pages retired due to errorsamdsmi_get_gpu_bad_page_info(): Lists specific bad memory pagesamdsmi_get_gpu_metrics_info(): May include error-related metricsamdsmi_get_violation_status(): Reports thermal or power violations that may correlate with errorsThese functions are particularly important for: