or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

examples

edge-cases.mdreal-world-scenarios.md
index.md
tile.json

gpu-errors.mddocs/reference/

GPU Error and RAS Monitoring

Functions for monitoring GPU errors, ECC (Error Correcting Code) status, and RAS (Reliability, Availability, and Serviceability) features. These functions provide access to hardware error counters, ECC memory protection status, RAS feature configuration, and CPER (Common Platform Error Record) entries for comprehensive GPU health monitoring and reliability analysis.

Capabilities

Get GPU Total ECC Count

Query total ECC error counts across all GPU blocks.

def amdsmi_get_gpu_total_ecc_count(
    processor_handle: processor_handle
) -> Dict[str, int]:
    """
    Get total ECC error count across all GPU blocks.

    Retrieves cumulative ECC error counts for the entire GPU, including correctable,
    uncorrectable, and deferred errors. This provides an overall view of memory
    reliability and error rates across all ECC-protected blocks.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query

    Returns:
    - Dict[str, int]: Dictionary containing total error counts with keys:
        - 'correctable_count' (int): Total number of correctable ECC errors detected
          and fixed by hardware. These errors don't affect data integrity.
        - 'uncorrectable_count' (int): Total number of uncorrectable ECC errors
          detected. These errors indicate data corruption and potential reliability issues.
        - 'deferred_count' (int): Total number of deferred errors. These are
          uncorrectable errors that haven't yet caused an immediate failure.

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve ECC counts or ECC not supported

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get total ECC error counts
        ecc_counts = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)
        print(f"Total ECC Errors:")
        print(f"  Correctable: {ecc_counts['correctable_count']}")
        print(f"  Uncorrectable: {ecc_counts['uncorrectable_count']}")
        print(f"  Deferred: {ecc_counts['deferred_count']}")

        # Check for reliability issues
        if ecc_counts['uncorrectable_count'] > 0:
            print("WARNING: Uncorrectable ECC errors detected!")

        if ecc_counts['correctable_count'] > 1000:
            print("NOTICE: High correctable error rate may indicate degradation")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU ECC Count for Block

Query ECC error counts for a specific GPU block.

def amdsmi_get_gpu_ecc_count(
    processor_handle: processor_handle,
    block: AmdSmiGpuBlock
) -> Dict[str, int]:
    """
    Get ECC error count for a specific GPU block.

    Retrieves ECC error counts for a particular hardware block on the GPU,
    allowing pinpointing of which component is experiencing errors. This is
    useful for diagnosing specific hardware issues.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query
    - block (AmdSmiGpuBlock): The GPU block to query for errors. Common blocks:
        - UMC: Unified Memory Controller (memory interface)
        - SDMA: System DMA engine
        - GFX: Graphics engine
        - MMHUB: Multimedia hub
        - HDP: Host Data Path
        - XGMI_WAFL: XGMI interconnect
        - And other blocks (see AmdSmiGpuBlock enum)

    Returns:
    - Dict[str, int]: Dictionary containing error counts for the specified block:
        - 'correctable_count' (int): Correctable errors in this block
        - 'uncorrectable_count' (int): Uncorrectable errors in this block
        - 'deferred_count' (int): Deferred errors in this block

    Raises:
    - AmdSmiParameterException: If processor_handle or block is invalid
    - AmdSmiLibraryException: If unable to retrieve ECC counts for the block

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Check UMC (memory controller) for errors
        umc_errors = amdsmi.amdsmi_get_gpu_ecc_count(
            gpu,
            amdsmi.AmdSmiGpuBlock.UMC
        )
        print(f"UMC (Memory Controller) Errors:")
        print(f"  Correctable: {umc_errors['correctable_count']}")
        print(f"  Uncorrectable: {umc_errors['uncorrectable_count']}")

        # Check graphics engine for errors
        gfx_errors = amdsmi.amdsmi_get_gpu_ecc_count(
            gpu,
            amdsmi.AmdSmiGpuBlock.GFX
        )
        print(f"\nGFX (Graphics Engine) Errors:")
        print(f"  Correctable: {gfx_errors['correctable_count']}")
        print(f"  Uncorrectable: {gfx_errors['uncorrectable_count']}")

        # Scan all blocks for errors
        print("\nAll GPU Block Error Summary:")
        for block in amdsmi.AmdSmiGpuBlock:
            if block.name in ["INVALID", "RESERVED"]:
                continue
            try:
                errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, block)
                total = (errors['correctable_count'] +
                        errors['uncorrectable_count'] +
                        errors['deferred_count'])
                if total > 0:
                    print(f"  {block.name}: {total} errors")
            except amdsmi.AmdSmiLibraryException:
                pass  # Block not available or not monitored

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU ECC Enabled

Check which GPU blocks have ECC protection enabled.

def amdsmi_get_gpu_ecc_enabled(
    processor_handle: processor_handle
) -> int:
    """
    Get bitmask indicating which GPU blocks have ECC enabled.

    Returns a 64-bit bitmask where each bit represents whether ECC protection
    is enabled for a corresponding GPU block. This allows checking the ECC
    configuration across all GPU components.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query

    Returns:
    - int: 64-bit bitmask where each bit corresponds to a GPU block.
      A bit value of 1 indicates ECC is enabled for that block.
      Bit positions correspond to AmdSmiGpuBlock enum values.

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve ECC enabled status

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get ECC enabled bitmask
        ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)
        print(f"ECC Enabled Bitmask: 0x{ecc_bitmask:016X}")

        # Check if specific blocks have ECC enabled
        print("\nECC Status by Block:")
        for block in amdsmi.AmdSmiGpuBlock:
            if block.name in ["INVALID", "RESERVED"]:
                continue

            # Check if bit for this block is set
            if ecc_bitmask & (1 << block.value):
                print(f"  {block.name}: ENABLED")
            else:
                print(f"  {block.name}: DISABLED")

        # Check if any ECC is enabled
        if ecc_bitmask == 0:
            print("\nWARNING: No ECC protection enabled on this GPU")
        else:
            print(f"\nECC protection is enabled on {bin(ecc_bitmask).count('1')} blocks")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU ECC Status

Query ECC status/state for a specific GPU block.

def amdsmi_get_gpu_ecc_status(
    processor_handle: processor_handle,
    block: AmdSmiGpuBlock
) -> AmdSmiRasErrState:
    """
    Get ECC status for a specific GPU block.

    Retrieves the current RAS error state for a particular GPU block, indicating
    what type of error correction is active or if errors have been detected.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query
    - block (AmdSmiGpuBlock): The GPU block to query for ECC status

    Returns:
    - AmdSmiRasErrState: The current RAS error state for the block:
        - NONE: No RAS support or no errors
        - DISABLED: RAS/ECC is disabled for this block
        - PARITY: Parity checking is enabled
        - SING_C: Single-bit correction enabled
        - MULT_UC: Multi-bit uncorrectable error detected
        - POISON: Poison bit error correction enabled
        - ENABLED: RAS/ECC is enabled (general)
        - INVALID: Invalid state

    Raises:
    - AmdSmiParameterException: If processor_handle or block is invalid
    - AmdSmiLibraryException: If unable to retrieve ECC status

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Check ECC status for memory controller
        umc_status = amdsmi.amdsmi_get_gpu_ecc_status(
            gpu,
            amdsmi.AmdSmiGpuBlock.UMC
        )
        print(f"UMC ECC Status: {umc_status.name}")

        # Check status across all blocks
        print("\nECC Status by Block:")
        for block in amdsmi.AmdSmiGpuBlock:
            if block.name in ["INVALID", "RESERVED"]:
                continue

            try:
                status = amdsmi.amdsmi_get_gpu_ecc_status(gpu, block)
                if status != amdsmi.AmdSmiRasErrState.NONE:
                    print(f"  {block.name}: {status.name}")

                    # Highlight concerning states
                    if status == amdsmi.AmdSmiRasErrState.MULT_UC:
                        print(f"    *** CRITICAL: Multi-bit uncorrectable error! ***")
                    elif status == amdsmi.AmdSmiRasErrState.DISABLED:
                        print(f"    Warning: ECC disabled")
            except amdsmi.AmdSmiLibraryException:
                pass  # Block not available

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU RAS Feature Info

Query RAS feature configuration and capabilities.

def amdsmi_get_gpu_ras_feature_info(
    processor_handle: processor_handle
) -> Dict[str, Any]:
    """
    Get RAS (Reliability, Availability, Serviceability) feature information.

    Retrieves information about the GPU's RAS capabilities, including the
    EEPROM version and supported error correction schemas. This provides
    insight into the GPU's error detection and correction capabilities.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query

    Returns:
    - Dict[str, Any]: Dictionary containing RAS feature information:
        - 'eeprom_version' (str): RAS EEPROM version as hex string
        - 'parity_schema' (bool): True if parity checking is supported
        - 'single_bit_schema' (bool): True if single-bit error correction supported
        - 'double_bit_schema' (bool): True if double-bit error detection supported
        - 'poison_schema' (bool): True if poison bit error handling supported

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve RAS feature info

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get RAS feature information
        ras_info = amdsmi.amdsmi_get_gpu_ras_feature_info(gpu)

        print("RAS Feature Information:")
        print(f"  EEPROM Version: {ras_info['eeprom_version']}")
        print(f"\nSupported Error Correction Schemas:")
        print(f"  Parity Checking: {ras_info['parity_schema']}")
        print(f"  Single-bit Correction: {ras_info['single_bit_schema']}")
        print(f"  Double-bit Detection: {ras_info['double_bit_schema']}")
        print(f"  Poison Bit Handling: {ras_info['poison_schema']}")

        # Determine overall RAS capability level
        if ras_info['double_bit_schema'] and ras_info['single_bit_schema']:
            print("\nRAS Capability: Advanced (SECDED - Single Error Correction, Double Error Detection)")
        elif ras_info['single_bit_schema']:
            print("\nRAS Capability: Basic (Single-bit correction)")
        elif ras_info['parity_schema']:
            print("\nRAS Capability: Minimal (Parity only)")
        else:
            print("\nRAS Capability: Limited or not configured")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU RAS Block Features Enabled

Query which RAS features are enabled for each GPU block.

def amdsmi_get_gpu_ras_block_features_enabled(
    processor_handle: processor_handle
) -> List[Dict[str, Any]]:
    """
    Get RAS features enabled status for all GPU blocks.

    Retrieves the RAS error state for each GPU block, providing a comprehensive
    view of which blocks have error protection enabled and their current status.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query

    Returns:
    - List[Dict[str, Any]]: List of dictionaries, one per GPU block, containing:
        - 'block' (str): Name of the GPU block (e.g., "UMC", "SDMA", "GFX")
        - 'status' (str): RAS error state name (e.g., "ENABLED", "DISABLED", "SING_C")

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve RAS block features

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get RAS features for all blocks
        ras_blocks = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(gpu)

        print("RAS Features by GPU Block:")
        print(f"{'Block':<15} {'Status':<20}")
        print("-" * 35)

        enabled_count = 0
        disabled_count = 0

        for block_info in ras_blocks:
            block_name = block_info['block']
            status = block_info['status']

            print(f"{block_name:<15} {status:<20}")

            if status == "ENABLED" or status == "SING_C":
                enabled_count += 1
            elif status == "DISABLED":
                disabled_count += 1

        print("\nSummary:")
        print(f"  Blocks with RAS enabled: {enabled_count}")
        print(f"  Blocks with RAS disabled: {disabled_count}")
        print(f"  Total blocks: {len(ras_blocks)}")

        # Filter for critical blocks with issues
        print("\nCritical Blocks Status:")
        critical_blocks = ["UMC", "GFX", "SDMA", "HDP"]
        for block_info in ras_blocks:
            if block_info['block'] in critical_blocks:
                status = block_info['status']
                status_indicator = "✓" if status != "DISABLED" else "✗"
                print(f"  {status_indicator} {block_info['block']}: {status}")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU CPER Entries

Retrieve CPER (Common Platform Error Record) entries.

def amdsmi_get_gpu_cper_entries(
    processor_handle: processor_handle,
    severity_mask: int,
    buffer_size: int = 4 * 1048576,
    cursor: int = 0
) -> Tuple[Dict[str, Any], int, List[Dict[str, Any]], int]:
    """
    Get CPER (Common Platform Error Record) entries from the GPU.

    CPER is a standardized error record format used in UEFI and ACPI for
    reporting hardware errors. This function retrieves raw CPER records
    from the GPU, which can be parsed to extract detailed error information.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query
    - severity_mask (int): Bitmask to filter CPER entries by severity level.
      Combine severity levels using bitwise OR:
        - 0x01: Recoverable errors
        - 0x02: Fatal errors
        - 0x04: Corrected errors
        - 0x08: Informational records
    - buffer_size (int, optional): Size of buffer for CPER data in bytes.
      Default is 4MB (4 * 1048576). Increase if expecting many records.
    - cursor (int, optional): Starting position for reading records, used for
      pagination. Default is 0 (start from beginning).

    Returns:
    - Tuple containing:
        1. entries (Dict[str, Any]): Placeholder dictionary for entry metadata
        2. buffer_size_used (int): Actual number of bytes used in buffer
        3. cper_data (List[Dict[str, Any]]): List of CPER records, each containing:
            - 'bytes' (List[int]): Raw CPER record as list of bytes
            - 'size' (int): Size of this CPER record in bytes
        4. next_cursor (int): Cursor position for next read (for pagination)

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve CPER entries

    Note:
    - Returns AMDSMI_STATUS_SUCCESS when all entries retrieved
    - Returns AMDSMI_STATUS_MORE_DATA if more entries available (use cursor for pagination)

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get all CPER entries (all severity levels)
        severity_mask = 0x0F  # All severities

        entries, buf_size, cper_data, next_cursor = amdsmi.amdsmi_get_gpu_cper_entries(
            gpu,
            severity_mask
        )

        print(f"Retrieved {len(cper_data)} CPER entries")
        print(f"Buffer size used: {buf_size} bytes")

        # Display CPER record information
        for i, record in enumerate(cper_data):
            print(f"\nCPER Record {i}:")
            print(f"  Size: {record['size']} bytes")
            print(f"  Data: {len(record['bytes'])} bytes")

            # First few bytes often contain header info
            if len(record['bytes']) >= 16:
                header_bytes = record['bytes'][:16]
                print(f"  Header: {' '.join(f'{b:02x}' for b in header_bytes)}")

        # Get only corrected errors
        print("\n--- Corrected Errors Only ---")
        corrected_mask = 0x04
        entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
            gpu,
            corrected_mask
        )
        print(f"Corrected error records: {len(cper_data)}")

        # Get only fatal errors
        print("\n--- Fatal Errors Only ---")
        fatal_mask = 0x02
        entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
            gpu,
            fatal_mask
        )
        print(f"Fatal error records: {len(cper_data)}")

        if len(cper_data) > 0:
            print("WARNING: Fatal errors detected!")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Enumerations

GPU Block Types

class AmdSmiGpuBlock(IntEnum):
    """
    GPU hardware block identifiers for error reporting and monitoring.

    These enums identify different functional blocks within the GPU that
    can be monitored for errors, ECC status, and RAS features.
    """
    INVALID = ...                 # Invalid block identifier
    UMC = ...                     # Unified Memory Controller (memory interface)
    SDMA = ...                    # System DMA engine
    GFX = ...                     # Graphics engine (compute/graphics shader cores)
    MMHUB = ...                   # Multimedia hub
    ATHUB = ...                   # Address Translation Hub
    PCIE_BIF = ...                # PCIe Bus Interface
    HDP = ...                     # Host Data Path
    XGMI_WAFL = ...               # XGMI interconnect (GPU-to-GPU communication)
    DF = ...                      # Data Fabric
    SMN = ...                     # System Management Network
    SEM = ...                     # Sensor Management
    MP0 = ...                     # Management Processor 0
    MP1 = ...                     # Management Processor 1
    FUSE = ...                    # Fuse controller
    MCA = ...                     # Machine Check Architecture
    VCN = ...                     # Video Core Next (video encode/decode)
    JPEG = ...                    # JPEG engine
    IH = ...                      # Interrupt Handler
    MPIO = ...                    # Multi-Purpose I/O
    RESERVED = ...                # Reserved block

RAS Error States

class AmdSmiRasErrState(IntEnum):
    """
    RAS (Reliability, Availability, Serviceability) error states.

    These states indicate the current error detection and correction
    status for a GPU block.
    """
    NONE = ...                    # No RAS support or no errors detected
    DISABLED = ...                # RAS/ECC is disabled for this block
    PARITY = ...                  # Parity checking is enabled (basic error detection)
    SING_C = ...                  # Single-bit error correction enabled (SEC)
    MULT_UC = ...                 # Multi-bit uncorrectable error detected (critical)
    POISON = ...                  # Poison bit error correction enabled
    ENABLED = ...                 # RAS/ECC is enabled (general state)
    INVALID = ...                 # Invalid state

Usage Patterns

Basic ECC Error Monitoring

Monitor ECC error counts across all GPU blocks:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, gpu in enumerate(devices):
        print(f"\nGPU {i} - ECC Error Summary:")

        # Get total error counts
        try:
            total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu)
            print(f"  Total Correctable Errors: {total_errors['correctable_count']}")
            print(f"  Total Uncorrectable Errors: {total_errors['uncorrectable_count']}")
            print(f"  Total Deferred Errors: {total_errors['deferred_count']}")

            # Check for concerning error levels
            if total_errors['uncorrectable_count'] > 0:
                print("  *** WARNING: Uncorrectable errors detected! ***")

            if total_errors['correctable_count'] > 10000:
                print("  *** NOTICE: High correctable error rate ***")

        except amdsmi.AmdSmiLibraryException as e:
            print(f"  ECC not supported or error retrieving counts: {e}")

finally:
    amdsmi.amdsmi_shut_down()

Detailed Block-Level Error Analysis

Scan all GPU blocks for errors and identify problem areas:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    print("GPU Block Error Analysis:")
    print(f"{'Block':<15} {'Correctable':<12} {'Uncorrectable':<15} {'Deferred':<10} {'Total':<10}")
    print("-" * 70)

    error_blocks = []

    for block in amdsmi.AmdSmiGpuBlock:
        if block.name in ["INVALID", "RESERVED"]:
            continue

        try:
            errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu, block)

            corr = errors['correctable_count']
            uncorr = errors['uncorrectable_count']
            defer = errors['deferred_count']
            total = corr + uncorr + defer

            # Only display blocks with errors
            if total > 0:
                print(f"{block.name:<15} {corr:<12} {uncorr:<15} {defer:<10} {total:<10}")

                if uncorr > 0:
                    error_blocks.append((block.name, uncorr))

        except amdsmi.AmdSmiLibraryException:
            pass  # Block not available

    # Highlight critical blocks with uncorrectable errors
    if error_blocks:
        print("\nCRITICAL: Blocks with uncorrectable errors:")
        for block_name, count in sorted(error_blocks, key=lambda x: x[1], reverse=True):
            print(f"  {block_name}: {count} uncorrectable errors")
    else:
        print("\nNo uncorrectable errors detected.")

finally:
    amdsmi.amdsmi_shut_down()

RAS Feature Discovery

Check RAS capabilities and configuration:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    print("=== RAS Feature Information ===\n")

    # Get RAS feature info
    ras_info = amdsmi.amdsmi_get_gpu_ras_feature_info(gpu)

    print(f"EEPROM Version: {ras_info['eeprom_version']}")
    print("\nSupported Error Correction Schemas:")
    print(f"  Parity:            {'Yes' if ras_info['parity_schema'] else 'No'}")
    print(f"  Single-bit (SEC):  {'Yes' if ras_info['single_bit_schema'] else 'No'}")
    print(f"  Double-bit (DED):  {'Yes' if ras_info['double_bit_schema'] else 'No'}")
    print(f"  Poison:            {'Yes' if ras_info['poison_schema'] else 'No'}")

    # Determine ECC capability level
    if ras_info['single_bit_schema'] and ras_info['double_bit_schema']:
        capability = "SECDED (Single Error Correction, Double Error Detection)"
    elif ras_info['single_bit_schema']:
        capability = "SEC (Single Error Correction)"
    elif ras_info['parity_schema']:
        capability = "Parity Only"
    else:
        capability = "None or Unknown"

    print(f"\nECC Capability Level: {capability}")

    # Get which blocks have ECC enabled
    print("\n=== ECC Enabled Status ===\n")
    ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu)

    enabled_blocks = []
    for block in amdsmi.AmdSmiGpuBlock:
        if block.name in ["INVALID", "RESERVED"]:
            continue
        if ecc_bitmask & (1 << block.value):
            enabled_blocks.append(block.name)

    if enabled_blocks:
        print(f"ECC Enabled on {len(enabled_blocks)} blocks:")
        for block in enabled_blocks:
            print(f"  - {block}")
    else:
        print("No blocks have ECC enabled")

    # Get detailed RAS status per block
    print("\n=== RAS Block Features ===\n")
    ras_blocks = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(gpu)

    for block_info in ras_blocks:
        block = block_info['block']
        status = block_info['status']

        # Highlight important blocks
        if block in ["UMC", "GFX", "SDMA", "HDP"]:
            marker = "***" if status == "DISABLED" else "   "
            print(f"{marker} {block:<15} {status}")

finally:
    amdsmi.amdsmi_shut_down()

ECC Health Check

Comprehensive ECC health monitoring function:

import amdsmi

def check_ecc_health(gpu_handle, gpu_id):
    """Perform comprehensive ECC health check."""
    print(f"\n{'='*60}")
    print(f"GPU {gpu_id} - ECC Health Check")
    print(f"{'='*60}\n")

    health_ok = True

    # Check if ECC is enabled
    try:
        ecc_bitmask = amdsmi.amdsmi_get_gpu_ecc_enabled(gpu_handle)
        if ecc_bitmask == 0:
            print("⚠ WARNING: ECC is not enabled on any GPU blocks")
            health_ok = False
        else:
            enabled_count = bin(ecc_bitmask).count('1')
            print(f"✓ ECC enabled on {enabled_count} blocks")
    except amdsmi.AmdSmiLibraryException as e:
        print(f"✗ Unable to check ECC enabled status: {e}")
        health_ok = False

    # Check total error counts
    try:
        total_errors = amdsmi.amdsmi_get_gpu_total_ecc_count(gpu_handle)

        print(f"\nTotal Error Counts:")
        print(f"  Correctable:   {total_errors['correctable_count']}")
        print(f"  Uncorrectable: {total_errors['uncorrectable_count']}")
        print(f"  Deferred:      {total_errors['deferred_count']}")

        # Check for critical issues
        if total_errors['uncorrectable_count'] > 0:
            print(f"\n✗ CRITICAL: {total_errors['uncorrectable_count']} uncorrectable errors detected!")
            health_ok = False

        # Check for high correctable error rate
        if total_errors['correctable_count'] > 10000:
            print(f"\n⚠ WARNING: High correctable error rate ({total_errors['correctable_count']})")
            print("  This may indicate memory degradation")
            health_ok = False
        elif total_errors['correctable_count'] > 0:
            print(f"\nℹ {total_errors['correctable_count']} correctable errors (within normal range)")

    except amdsmi.AmdSmiLibraryException as e:
        print(f"✗ Unable to retrieve error counts: {e}")
        health_ok = False

    # Check critical blocks
    print(f"\nCritical Block Status:")
    critical_blocks = [
        amdsmi.AmdSmiGpuBlock.UMC,     # Memory controller
        amdsmi.AmdSmiGpuBlock.GFX,     # Graphics engine
        amdsmi.AmdSmiGpuBlock.HDP,     # Host data path
        amdsmi.AmdSmiGpuBlock.SDMA,    # System DMA
    ]

    for block in critical_blocks:
        try:
            # Check ECC status
            status = amdsmi.amdsmi_get_gpu_ecc_status(gpu_handle, block)

            # Check error counts
            errors = amdsmi.amdsmi_get_gpu_ecc_count(gpu_handle, block)
            uncorr = errors['uncorrectable_count']

            status_str = f"{block.name:<10} Status: {status.name:<10}"
            if uncorr > 0:
                print(f"  ✗ {status_str} Uncorrectable: {uncorr}")
                health_ok = False
            elif status == amdsmi.AmdSmiRasErrState.DISABLED:
                print(f"  ⚠ {status_str} (ECC disabled)")
            else:
                print(f"  ✓ {status_str}")

        except amdsmi.AmdSmiLibraryException:
            print(f"  - {block.name:<10} Not available")

    # Overall health verdict
    print(f"\n{'='*60}")
    if health_ok:
        print("✓ ECC Health: GOOD")
    else:
        print("✗ ECC Health: ISSUES DETECTED - Investigation recommended")
    print(f"{'='*60}")

    return health_ok

# Example usage
amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()

    all_healthy = True
    for i, gpu in enumerate(devices):
        is_healthy = check_ecc_health(gpu, i)
        all_healthy = all_healthy and is_healthy

    if not all_healthy:
        print("\n*** Some GPUs have ECC issues - check logs above ***")

finally:
    amdsmi.amdsmi_shut_down()

CPER Error Record Retrieval

Retrieve and analyze CPER error records:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    print("Retrieving CPER Error Records...\n")

    # Define severity levels
    severity_levels = {
        "Recoverable": 0x01,
        "Fatal": 0x02,
        "Corrected": 0x04,
        "Informational": 0x08,
    }

    # Get all error records
    all_severity = 0x0F  # All levels

    try:
        entries, buf_size, cper_data, cursor = amdsmi.amdsmi_get_gpu_cper_entries(
            gpu,
            all_severity
        )

        print(f"Total CPER records retrieved: {len(cper_data)}")
        print(f"Buffer size used: {buf_size} bytes\n")

        if len(cper_data) == 0:
            print("No CPER error records found.")
        else:
            # Analyze each record
            for i, record in enumerate(cper_data):
                print(f"Record {i}:")
                print(f"  Size: {record['size']} bytes")

                # Display first 32 bytes of record (header region)
                if len(record['bytes']) >= 32:
                    header = record['bytes'][:32]
                    print(f"  Header (first 32 bytes):")
                    for j in range(0, 32, 16):
                        hex_str = ' '.join(f'{b:02x}' for b in header[j:j+16])
                        print(f"    {hex_str}")
                print()

        # Get records by severity level
        print("\nRecords by Severity Level:")
        for severity_name, severity_mask in severity_levels.items():
            try:
                entries, buf_size, cper_data, _ = amdsmi.amdsmi_get_gpu_cper_entries(
                    gpu,
                    severity_mask
                )
                print(f"  {severity_name}: {len(cper_data)} records")

                if severity_name == "Fatal" and len(cper_data) > 0:
                    print("    *** CRITICAL: Fatal errors present! ***")

            except amdsmi.AmdSmiLibraryException:
                print(f"  {severity_name}: Unable to retrieve")

    except amdsmi.AmdSmiLibraryException as e:
        print(f"Error retrieving CPER entries: {e}")

finally:
    amdsmi.amdsmi_shut_down()

Continuous ECC Monitoring

Monitor ECC errors over time to detect degradation:

import amdsmi
import time
from collections import defaultdict

class ECCMonitor:
    """Monitor ECC errors over time."""

    def __init__(self, gpu_handle):
        self.gpu_handle = gpu_handle
        self.history = []
        self.baseline = None

    def capture_baseline(self):
        """Capture initial error counts as baseline."""
        try:
            counts = amdsmi.amdsmi_get_gpu_total_ecc_count(self.gpu_handle)
            self.baseline = counts.copy()
            print(f"Baseline captured: {counts}")
        except amdsmi.AmdSmiLibraryException as e:
            print(f"Unable to capture baseline: {e}")

    def check_errors(self):
        """Check current error counts and compare to baseline."""
        try:
            counts = amdsmi.amdsmi_get_gpu_total_ecc_count(self.gpu_handle)

            if self.baseline:
                delta = {
                    'correctable_count': counts['correctable_count'] - self.baseline['correctable_count'],
                    'uncorrectable_count': counts['uncorrectable_count'] - self.baseline['uncorrectable_count'],
                    'deferred_count': counts['deferred_count'] - self.baseline['deferred_count'],
                }
            else:
                delta = None

            self.history.append({
                'timestamp': time.time(),
                'counts': counts,
                'delta': delta,
            })

            return counts, delta

        except amdsmi.AmdSmiLibraryException as e:
            print(f"Error checking ECC counts: {e}")
            return None, None

    def get_error_rate(self, window_seconds=60):
        """Calculate error rate over time window."""
        now = time.time()
        cutoff = now - window_seconds

        recent = [h for h in self.history if h['timestamp'] >= cutoff]

        if len(recent) < 2:
            return None

        first = recent[0]['counts']
        last = recent[-1]['counts']
        duration = recent[-1]['timestamp'] - recent[0]['timestamp']

        if duration == 0:
            return None

        return {
            'correctable_per_second': (last['correctable_count'] - first['correctable_count']) / duration,
            'uncorrectable_per_second': (last['uncorrectable_count'] - first['uncorrectable_count']) / duration,
            'duration': duration,
        }

# Example: Monitor for 5 minutes
amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    monitor = ECCMonitor(gpu)
    monitor.capture_baseline()

    print("\nMonitoring ECC errors for 5 minutes...")
    print("Press Ctrl+C to stop early\n")

    try:
        for i in range(60):  # 5 minutes with 5-second intervals
            time.sleep(5)

            counts, delta = monitor.check_errors()

            if counts and delta:
                print(f"[{i*5:3d}s] Correctable: +{delta['correctable_count']:3d}  "
                      f"Uncorrectable: +{delta['uncorrectable_count']:3d}  "
                      f"Deferred: +{delta['deferred_count']:3d}")

                # Alert on new uncorrectable errors
                if delta['uncorrectable_count'] > 0:
                    print(f"       *** ALERT: New uncorrectable errors detected! ***")

            # Show error rate every 60 seconds
            if (i + 1) % 12 == 0:  # Every 60 seconds
                rate = monitor.get_error_rate(60)
                if rate:
                    print(f"\n  Error Rate (last 60s):")
                    print(f"    Correctable: {rate['correctable_per_second']:.4f} errors/sec")
                    print(f"    Uncorrectable: {rate['uncorrectable_per_second']:.4f} errors/sec\n")

    except KeyboardInterrupt:
        print("\nMonitoring stopped by user")

    # Final summary
    print("\n=== Monitoring Summary ===")
    if monitor.baseline and len(monitor.history) > 0:
        final = monitor.history[-1]['counts']
        print(f"Initial errors: {monitor.baseline}")
        print(f"Final errors:   {final}")
        print(f"\nTotal new errors:")
        print(f"  Correctable:   {final['correctable_count'] - monitor.baseline['correctable_count']}")
        print(f"  Uncorrectable: {final['uncorrectable_count'] - monitor.baseline['uncorrectable_count']}")
        print(f"  Deferred:      {final['deferred_count'] - monitor.baseline['deferred_count']}")

finally:
    amdsmi.amdsmi_shut_down()

Notes

ECC Overview

  • ECC (Error Correcting Code): Hardware mechanism for detecting and correcting memory errors
  • Correctable errors: Single-bit errors that ECC can fix automatically without data loss
  • Uncorrectable errors: Multi-bit errors that ECC cannot fix, indicating data corruption
  • Deferred errors: Uncorrectable errors that haven't caused immediate failure but may lead to issues

RAS Features

  • RAS (Reliability, Availability, Serviceability): Framework for error detection, correction, and reporting
  • Parity: Basic error detection (can detect errors but not correct them)
  • SEC (Single Error Correction): Can correct single-bit errors
  • SECDED (Single Error Correction, Double Error Detection): Can correct single-bit and detect double-bit errors
  • Poison: Marks corrupted data to prevent propagation of errors

GPU Blocks

Critical blocks to monitor:

  • UMC (Unified Memory Controller): Memory interface - errors here indicate VRAM issues
  • GFX (Graphics Engine): Compute and graphics cores - errors affect computation accuracy
  • SDMA (System DMA): Data transfer engine - errors affect data integrity during transfers
  • HDP (Host Data Path): CPU-GPU communication - errors affect host data transfers
  • XGMI_WAFL: GPU-to-GPU interconnect - errors affect multi-GPU communication

Error Interpretation

  • Low correctable error count (< 1000): Normal operation, ECC working as designed
  • High correctable error count (> 10,000): May indicate memory degradation, monitor closely
  • Any uncorrectable errors: Critical issue, indicates data corruption has occurred
  • Increasing error rate: Suggests progressive hardware failure

Get and Reset XGMI Errors

Query and reset XGMI link error status.

def amdsmi_gpu_xgmi_error_status(processor_handle: processor_handle) -> AmdSmiXgmiStatus:
    """
    Get XGMI error status for a GPU.

    Retrieves the current XGMI (AMD Infinity Fabric) link error status. XGMI is
    the high-speed interconnect used for GPU-to-GPU communication in multi-GPU
    systems. This function helps detect link errors that may affect communication
    between GPUs.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query

    Returns:
    - AmdSmiXgmiStatus: XGMI status enumeration value indicating error state

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to retrieve XGMI error status

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()

        # Check XGMI error status for each GPU
        print("XGMI Error Status:")
        for i, gpu in enumerate(devices):
            try:
                xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
                print(f"  GPU {i}: {xgmi_status}")

                # Check for errors (non-zero status typically indicates errors)
                if xgmi_status != 0:
                    print(f"    WARNING: GPU {i} has XGMI errors!")

            except amdsmi.AmdSmiLibraryException as e:
                print(f"  GPU {i}: XGMI status not available")

    finally:
        amdsmi.amdsmi_shut_down()
    ```

    Note:
    - Only applicable to systems with multiple GPUs connected via XGMI
    - Returns error status for the XGMI links connected to this GPU
    - Non-zero values typically indicate link errors or degraded performance
    """

def amdsmi_reset_gpu_xgmi_error(processor_handle: processor_handle) -> None:
    """
    Reset XGMI error counters for a GPU.

    Clears accumulated XGMI error counters, allowing fresh error monitoring.
    This is useful after diagnosing and addressing XGMI link issues, or for
    establishing a clean baseline for error monitoring.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device

    Returns:
    - None

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: If unable to reset XGMI errors (may require root privileges)

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()

        for i, gpu in enumerate(devices):
            try:
                # Check current XGMI error status
                xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
                print(f"GPU {i} XGMI status: {xgmi_status}")

                if xgmi_status != 0:
                    print(f"  Resetting XGMI errors on GPU {i}...")
                    amdsmi.amdsmi_reset_gpu_xgmi_error(gpu)

                    # Verify errors were cleared
                    new_status = amdsmi.amdsmi_gpu_xgmi_error_status(gpu)
                    print(f"  After reset: {new_status}")

            except amdsmi.AmdSmiLibraryException as e:
                print(f"  GPU {i}: Cannot reset XGMI errors")

    finally:
        amdsmi.amdsmi_shut_down()
    ```

    Note:
    - Typically requires root/administrator privileges
    - Only applicable to systems with XGMI-connected GPUs
    - Resets error counters but does not fix underlying hardware issues
    - Use after addressing physical link problems or for baseline establishment
    """

Related Concepts

CPER (Common Platform Error Record)

  • Standard error record format defined by UEFI specification
  • Used for hardware error reporting in enterprise systems
  • Records contain detailed error information including:
    • Error type and severity
    • Location (which GPU block)
    • Timestamp
    • Additional diagnostic data
  • Useful for RAS-capable enterprise GPUs and data center deployments

Best Practices

  1. Check ECC status at startup: Verify ECC is enabled before workload execution
  2. Monitor critical blocks: Focus on UMC, GFX, and SDMA blocks
  3. Set thresholds: Alert on uncorrectable errors or high correctable error rates
  4. Baseline measurements: Capture initial error counts to detect new errors
  5. Periodic checks: Monitor errors at regular intervals during long-running workloads
  6. Log errors: Record error events for trend analysis and predictive maintenance
  7. Correlate with workload: High error rates during specific operations may indicate sensitivity

Performance Considerations

  • Error count queries are relatively lightweight operations
  • Can be called at 1-5 second intervals for active monitoring
  • CPER retrieval may be more expensive, use less frequently
  • Consider buffering and batching error reports for high-frequency monitoring

Availability

  • ECC support varies by GPU model (typically data center/professional GPUs)
  • Consumer GPUs may not support ECC or RAS features
  • Not all blocks support ECC even on ECC-capable GPUs
  • Some functions may return AMDSMI_STATUS_NOT_SUPPORTED on unsupported hardware

Relationship to Other Functions

  • amdsmi_get_gpu_memory_reserved_pages(): Shows memory pages retired due to errors
  • amdsmi_get_gpu_bad_page_info(): Lists specific bad memory pages
  • amdsmi_get_gpu_metrics_info(): May include error-related metrics
  • amdsmi_get_violation_status(): Reports thermal or power violations that may correlate with errors

Error Recovery

  • Correctable errors are automatically fixed by hardware (no action needed)
  • Uncorrectable errors typically require:
    1. Identifying affected memory region
    2. Retiring bad memory pages (if supported)
    3. Potential GPU replacement if errors persist
    4. Workload recomputation to recover from corrupted data

Enterprise and Data Center Use

These functions are particularly important for:

  • Data centers: Monitoring fleet health and predicting failures
  • HPC (High Performance Computing): Ensuring computational accuracy
  • AI/ML training: Detecting silent data corruption in long-running training jobs
  • Scientific computing: Validating result integrity
  • Cloud providers: SLA compliance and proactive hardware management