or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

examples

edge-cases.mdreal-world-scenarios.md
index.md
tile.json

hardware-topology.mddocs/reference/

Hardware Topology

Functions for querying hardware topology information including NUMA affinity, inter-processor connectivity, PCIe bandwidth, XGMI links, and P2P (peer-to-peer) access capabilities. These functions help understand the physical and logical layout of processors and their interconnections in multi-GPU and heterogeneous computing systems.

Capabilities

Get NUMA Node Number

Get the NUMA (Non-Uniform Memory Access) node number for a processor.

def amdsmi_topo_get_numa_node_number(processor_handle: processor_handle) -> int:
    """
    Get the NUMA node number associated with a processor.

    NUMA nodes represent memory regions with different access latencies from different processors.
    This function returns the NUMA node closest to the specified processor, which is useful for
    optimizing memory allocation and data placement in NUMA systems.

    Parameters:
    - processor_handle (processor_handle): Handle for the processor to query

    Returns:
    - int: NUMA node number (typically 0-indexed)

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve NUMA node information

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for processor in processors:
            numa_node = amdsmi.amdsmi_topo_get_numa_node_number(processor)
            print(f"Processor NUMA node: {numa_node}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get Link Weight

Get the relative weight (distance) of the link between two processors.

def amdsmi_topo_get_link_weight(
    processor_handle_src: processor_handle,
    processor_handle_dst: processor_handle
) -> int:
    """
    Get the link weight (distance metric) between two processors.

    The link weight represents the relative cost or distance of communication between two
    processors. Lower weights indicate closer/faster connections. This metric helps determine
    optimal data placement and communication patterns in multi-processor systems.

    Parameters:
    - processor_handle_src (processor_handle): Handle for the source processor
    - processor_handle_dst (processor_handle): Handle for the destination processor

    Returns:
    - int: Link weight value (lower values indicate closer proximity)

    Raises:
    - AmdSmiParameterException: If either processor handle is not valid
    - AmdSmiLibraryException: If unable to retrieve link weight

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if len(processors) >= 2:
            weight = amdsmi.amdsmi_topo_get_link_weight(processors[0], processors[1])
            print(f"Link weight between GPU 0 and GPU 1: {weight}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get Min/Max Bandwidth Between Processors

Get the minimum and maximum bandwidth between two processors.

def amdsmi_get_minmax_bandwidth_between_processors(
    processor_handle_src: processor_handle,
    processor_handle_dst: processor_handle
) -> Dict[str, int]:
    """
    Get the minimum and maximum theoretical bandwidth between two processors.

    Returns the bandwidth capabilities of the link between two processors, which is useful
    for understanding data transfer performance characteristics and optimizing workload
    distribution.

    Parameters:
    - processor_handle_src (processor_handle): Handle for the source processor
    - processor_handle_dst (processor_handle): Handle for the destination processor

    Returns:
    - Dict[str, int]: Dictionary containing:
        - "min_bandwidth" (int): Minimum bandwidth in MB/s
        - "max_bandwidth" (int): Maximum bandwidth in MB/s

    Raises:
    - AmdSmiParameterException: If either processor handle is not valid
    - AmdSmiLibraryException: If unable to retrieve bandwidth information

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if len(processors) >= 2:
            bw = amdsmi.amdsmi_get_minmax_bandwidth_between_processors(
                processors[0], processors[1]
            )
            print(f"Min bandwidth: {bw['min_bandwidth']} MB/s")
            print(f"Max bandwidth: {bw['max_bandwidth']} MB/s")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get Link Metrics

Get detailed link metrics for all connections from a processor.

def amdsmi_get_link_metrics(processor_handle: processor_handle) -> Dict[str, Any]:
    """
    Get comprehensive link metrics for all connections from a processor.

    Returns detailed information about all links connected to the specified processor,
    including link types, bandwidth, and data transfer statistics. This is particularly
    useful for XGMI-connected GPUs.

    Parameters:
    - processor_handle (processor_handle): Handle for the processor to query

    Returns:
    - Dict[str, Any]: Dictionary containing:
        - "num_links" (int): Number of active links
        - "links" (List[Dict]): List of link information dictionaries, each containing:
            - "bdf" (str): BDF address of the connected device
            - "bit_rate" (int): Link bit rate
            - "max_bandwidth" (int): Maximum bandwidth in MB/s
            - "link_type" (int): Type of link (XGMI, PCIe, etc.)
            - "read" (int): Read bandwidth usage
            - "write" (int): Write bandwidth usage

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve link metrics

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for processor in processors:
            metrics = amdsmi.amdsmi_get_link_metrics(processor)
            print(f"Number of links: {metrics['num_links']}")
            for i, link in enumerate(metrics['links'][:metrics['num_links']]):
                print(f"  Link {i}:")
                print(f"    Connected to: {link['bdf']}")
                print(f"    Max bandwidth: {link['max_bandwidth']} MB/s")
                print(f"    Read: {link['read']} MB/s")
                print(f"    Write: {link['write']} MB/s")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get Link Type

Get the type of link between two processors.

def amdsmi_topo_get_link_type(
    processor_handle_src: processor_handle,
    processor_handle_dst: processor_handle
) -> Dict[str, int]:
    """
    Get the link type and hop count between two processors.

    Determines the type of interconnect (XGMI, PCIe, internal, etc.) between two processors
    and the number of hops required for communication.

    Parameters:
    - processor_handle_src (processor_handle): Handle for the source processor
    - processor_handle_dst (processor_handle): Handle for the destination processor

    Returns:
    - Dict[str, int]: Dictionary containing:
        - "hops" (int): Number of hops between processors
        - "type" (int): Link type enum value (see AmdSmiLinkType)

    Raises:
    - AmdSmiParameterException: If either processor handle is not valid
    - AmdSmiLibraryException: If unable to retrieve link type

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if len(processors) >= 2:
            link_info = amdsmi.amdsmi_topo_get_link_type(processors[0], processors[1])
            print(f"Link hops: {link_info['hops']}")
            print(f"Link type: {link_info['type']}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get P2P Status

Get the peer-to-peer (P2P) status and capabilities between two processors.

def amdsmi_topo_get_p2p_status(
    processor_handle_src: processor_handle,
    processor_handle_dst: processor_handle
) -> Dict[str, Any]:
    """
    Get P2P (peer-to-peer) status and capabilities between two processors.

    Returns detailed information about P2P connectivity capabilities including coherency,
    atomics support, DMA capabilities, and bi-directional communication support.

    Parameters:
    - processor_handle_src (processor_handle): Handle for the source processor
    - processor_handle_dst (processor_handle): Handle for the destination processor

    Returns:
    - Dict[str, Any]: Dictionary containing:
        - "type" (int): P2P connection type
        - "cap" (Dict[str, bool]): Capability flags dictionary with:
            - "is_iolink_coherent" (bool): Whether the I/O link is cache coherent
            - "is_iolink_atomics_32bit" (bool): 32-bit atomic operations supported
            - "is_iolink_atomics_64bit" (bool): 64-bit atomic operations supported
            - "is_iolink_dma" (bool): DMA (Direct Memory Access) supported
            - "is_iolink_bi_directional" (bool): Bi-directional transfers supported

    Raises:
    - AmdSmiParameterException: If either processor handle is not valid
    - AmdSmiLibraryException: If unable to retrieve P2P status

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if len(processors) >= 2:
            p2p = amdsmi.amdsmi_topo_get_p2p_status(processors[0], processors[1])
            print(f"P2P capabilities between GPU 0 and GPU 1:")
            print(f"  Coherent: {p2p['cap']['is_iolink_coherent']}")
            print(f"  32-bit atomics: {p2p['cap']['is_iolink_atomics_32bit']}")
            print(f"  64-bit atomics: {p2p['cap']['is_iolink_atomics_64bit']}")
            print(f"  DMA: {p2p['cap']['is_iolink_dma']}")
            print(f"  Bi-directional: {p2p['cap']['is_iolink_bi_directional']}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Check P2P Accessibility

Check if P2P access is possible between two processors.

def amdsmi_is_P2P_accessible(
    processor_handle_src: processor_handle,
    processor_handle_dst: processor_handle
) -> bool:
    """
    Check if P2P (peer-to-peer) access is accessible between two processors.

    Returns a simple boolean indicating whether direct P2P memory access is possible
    between the two specified processors. This is useful for quickly determining if
    direct GPU-to-GPU communication is available.

    Parameters:
    - processor_handle_src (processor_handle): Handle for the source processor
    - processor_handle_dst (processor_handle): Handle for the destination processor

    Returns:
    - bool: True if P2P access is accessible, False otherwise

    Raises:
    - AmdSmiParameterException: If either processor handle is not valid
    - AmdSmiLibraryException: If unable to determine P2P accessibility

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if len(processors) >= 2:
            accessible = amdsmi.amdsmi_is_P2P_accessible(processors[0], processors[1])
            if accessible:
                print("P2P access is available between GPU 0 and GPU 1")
            else:
                print("P2P access is NOT available between GPU 0 and GPU 1")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get XGMI Information

Get XGMI (AMD Infinity Fabric) link information for a processor.

def amdsmi_get_xgmi_info(processor_handle: processor_handle) -> Dict[str, int]:
    """
    Get XGMI (AMD Infinity Fabric) information for a processor.

    XGMI is AMD's high-speed interconnect technology that enables efficient GPU-to-GPU
    communication. This function returns XGMI-specific identifiers and configuration.

    Parameters:
    - processor_handle (processor_handle): Handle for the processor to query

    Returns:
    - Dict[str, int]: Dictionary containing:
        - "xgmi_lanes" (int): Number of XGMI lanes
        - "xgmi_hive_id" (int): XGMI hive identifier (GPUs in same hive can communicate)
        - "xgmi_node_id" (int): Unique node ID within the hive
        - "index" (int): Index of the device

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve XGMI information

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for i, processor in enumerate(processors):
            xgmi = amdsmi.amdsmi_get_xgmi_info(processor)
            print(f"GPU {i} XGMI info:")
            print(f"  Lanes: {xgmi['xgmi_lanes']}")
            print(f"  Hive ID: {xgmi['xgmi_hive_id']}")
            print(f"  Node ID: {xgmi['xgmi_node_id']}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get Link Topology Nearest

Get nearest processors of a specific link type.

def amdsmi_get_link_topology_nearest(
    processor_handle: processor_handle,
    link_type: AmdSmiLinkType
) -> Dict[str, List[processor_handle]]:
    """
    Get the list of nearest processors connected by a specific link type.

    Returns a list of processor handles that are connected to the specified processor
    via the requested link type (XGMI, PCIe, etc.). This helps identify topology
    neighborhoods for optimization.

    Parameters:
    - processor_handle (processor_handle): Handle for the processor to query
    - link_type (AmdSmiLinkType): Type of link to search for (XGMI, PCIe, etc.)

    Returns:
    - Dict[str, List[processor_handle]]: Dictionary containing:
        - "processor_list" (List[processor_handle]): List of connected processor handles

    Raises:
    - AmdSmiParameterException: If processor_handle or link_type is not valid
    - AmdSmiLibraryException: If unable to retrieve topology information

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if processors:
            # Find all GPUs connected via XGMI
            nearest = amdsmi.amdsmi_get_link_topology_nearest(
                processors[0],
                amdsmi.AmdSmiLinkType.AMDSMI_LINK_TYPE_XGMI
            )
            print(f"Found {len(nearest['processor_list'])} XGMI-connected GPUs")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU BDF ID

Get the BDF (Bus:Device.Function) ID as an integer for a GPU.

def amdsmi_get_gpu_bdf_id(processor_handle: processor_handle) -> int:
    """
    Get the BDF (Bus:Device.Function) identifier as a 64-bit integer.

    Returns the PCI BDF address encoded as an integer value. This is an alternative
    to amdsmi_get_gpu_device_bdf() which returns a formatted string.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - int: BDF ID encoded as a 64-bit integer

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve BDF ID

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for processor in processors:
            bdf_id = amdsmi.amdsmi_get_gpu_bdf_id(processor)
            print(f"GPU BDF ID: 0x{bdf_id:016x}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU PCI Bandwidth

Get the PCIe bandwidth configuration for a GPU.

def amdsmi_get_gpu_pci_bandwidth(processor_handle: processor_handle) -> Dict[str, Any]:
    """
    Get PCIe bandwidth configuration including transfer rates and lanes.

    Returns information about the GPU's PCIe connection, including supported and current
    transfer rates (PCIe generations) and lane configurations.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - Dict[str, Any]: Dictionary containing:
        - "transfer_rate" (Dict): Transfer rate information with:
            - "num_supported" (int): Number of supported transfer rates
            - "current" (int): Current transfer rate index
            - "frequency" (List[int]): List of supported frequencies
        - "lanes" (List[int]): PCIe lane configurations

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve bandwidth information

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for processor in processors:
            bw = amdsmi.amdsmi_get_gpu_pci_bandwidth(processor)
            print(f"PCIe bandwidth:")
            print(f"  Current rate index: {bw['transfer_rate']['current']}")
            print(f"  Supported rates: {bw['transfer_rate']['num_supported']}")
            print(f"  Frequencies: {bw['transfer_rate']['frequency']}")
            print(f"  Lane configs: {bw['lanes']}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU PCI Throughput

Get current PCIe throughput statistics for a GPU.

def amdsmi_get_gpu_pci_throughput(processor_handle: processor_handle) -> Dict[str, int]:
    """
    Get current PCIe throughput statistics including sent and received data.

    Returns real-time PCIe data transfer statistics, useful for monitoring actual
    PCIe bus utilization and identifying potential bottlenecks.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - Dict[str, int]: Dictionary containing:
        - "sent" (int): Number of bytes sent over PCIe
        - "received" (int): Number of bytes received over PCIe
        - "max_pkt_sz" (int): Maximum packet size in bytes

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve throughput information

    Example:
    ```python
    import amdsmi
    import time

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        if processors:
            # Measure throughput over time
            throughput1 = amdsmi.amdsmi_get_gpu_pci_throughput(processors[0])
            time.sleep(1)
            throughput2 = amdsmi.amdsmi_get_gpu_pci_throughput(processors[0])

            sent_rate = (throughput2['sent'] - throughput1['sent']) / 1e6
            recv_rate = (throughput2['received'] - throughput1['received']) / 1e6
            print(f"PCIe throughput (MB/s):")
            print(f"  Sent: {sent_rate:.2f}")
            print(f"  Received: {recv_rate:.2f}")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU PCI Replay Counter

Get the PCIe replay counter for a GPU.

def amdsmi_get_gpu_pci_replay_counter(processor_handle: processor_handle) -> int:
    """
    Get the PCIe replay counter value for a GPU.

    The PCIe replay counter tracks the number of packet retransmissions on the PCIe bus.
    A high or increasing replay count may indicate signal integrity issues, electrical
    problems, or other PCIe link quality concerns.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - int: Number of PCIe packet replays

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve replay counter

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for i, processor in enumerate(processors):
            replay_count = amdsmi.amdsmi_get_gpu_pci_replay_counter(processor)
            print(f"GPU {i} PCIe replay count: {replay_count}")
            if replay_count > 0:
                print(f"  Warning: GPU {i} has PCIe replay events")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU Topology NUMA Affinity

Get the NUMA affinity for a GPU (alternative to amdsmi_topo_get_numa_node_number).

def amdsmi_get_gpu_topo_numa_affinity(processor_handle: processor_handle) -> int:
    """
    Get the NUMA affinity (node number) for a GPU.

    This is an alternative function to amdsmi_topo_get_numa_node_number() that specifically
    targets GPU devices. Returns the NUMA node that the GPU is closest to, which is critical
    for optimizing memory allocation in NUMA systems.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - int: NUMA node number (-1 if not applicable or unknown)

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve NUMA affinity

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for i, processor in enumerate(processors):
            numa_node = amdsmi.amdsmi_get_gpu_topo_numa_affinity(processor)
            if numa_node >= 0:
                print(f"GPU {i} is on NUMA node {numa_node}")
            else:
                print(f"GPU {i} has no NUMA affinity")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU XGMI Error Status

Get the XGMI error status for a GPU.

def amdsmi_gpu_xgmi_error_status(processor_handle: processor_handle) -> str:
    """
    Get the XGMI error status for a GPU.

    Checks for XGMI link errors on the specified GPU. XGMI errors can indicate
    hardware issues, signal integrity problems, or other connectivity concerns
    in multi-GPU XGMI configurations.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - str: XGMI status string from AmdSmiXgmiStatus enum:
        - "NO_ERRORS": No XGMI errors detected
        - "ERROR": Single error detected
        - "MULTIPLE_ERRORS": Multiple errors detected

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to retrieve XGMI error status

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for i, processor in enumerate(processors):
            status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
            print(f"GPU {i} XGMI status: {status}")
            if status != "NO_ERRORS":
                print(f"  Warning: GPU {i} has XGMI errors!")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Reset GPU XGMI Error

Reset (clear) XGMI error counters for a GPU.

def amdsmi_reset_gpu_xgmi_error(processor_handle: processor_handle) -> None:
    """
    Reset (clear) XGMI error counters for a GPU.

    Clears the XGMI error status for the specified GPU. This is useful after
    acknowledging and addressing XGMI errors, allowing fresh monitoring of
    link health.

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU processor

    Returns:
    - None

    Raises:
    - AmdSmiParameterException: If processor_handle is not valid
    - AmdSmiLibraryException: If unable to reset XGMI errors

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        processors = amdsmi.amdsmi_get_processor_handles()
        for processor in processors:
            # Check for errors
            status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
            if status != "NO_ERRORS":
                print(f"Detected XGMI errors: {status}")
                # Clear the errors
                amdsmi.amdsmi_reset_gpu_xgmi_error(processor)
                print("XGMI errors cleared")
    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Enumerations

AmdSmiLinkType

Link type enumeration for topology queries.

class AmdSmiLinkType(IntEnum):
    """
    Enumeration of link types between processors.

    Used to specify or identify the type of interconnect between processors
    in topology queries.
    """
    AMDSMI_LINK_TYPE_INTERNAL = ...      # Internal/on-chip connection
    AMDSMI_LINK_TYPE_XGMI = ...          # XGMI (AMD Infinity Fabric) connection
    AMDSMI_LINK_TYPE_PCIE = ...          # PCIe connection
    AMDSMI_LINK_TYPE_NOT_APPLICABLE = ...  # No connection or not applicable
    AMDSMI_LINK_TYPE_UNKNOWN = ...       # Unknown link type

AmdSmiXgmiStatus

XGMI error status enumeration.

class AmdSmiXgmiStatus(IntEnum):
    """
    Enumeration of XGMI error status values.

    Indicates the error state of XGMI links on a GPU.
    """
    NO_ERRORS = ...         # No XGMI errors detected
    ERROR = ...             # Single XGMI error detected
    MULTIPLE_ERRORS = ...   # Multiple XGMI errors detected

Usage Examples

Multi-GPU Topology Discovery

Discover and analyze the topology of a multi-GPU system:

import amdsmi

amdsmi.amdsmi_init()
try:
    processors = amdsmi.amdsmi_get_processor_handles()
    print(f"Found {len(processors)} GPUs\n")

    # Print topology information for each GPU
    for i, processor in enumerate(processors):
        print(f"GPU {i}:")

        # Get BDF and NUMA info
        bdf = amdsmi.amdsmi_get_gpu_device_bdf(processor)
        numa_node = amdsmi.amdsmi_get_gpu_topo_numa_affinity(processor)
        print(f"  BDF: {bdf}")
        print(f"  NUMA Node: {numa_node}")

        # Get XGMI info
        xgmi = amdsmi.amdsmi_get_xgmi_info(processor)
        print(f"  XGMI Hive ID: {xgmi['xgmi_hive_id']}")
        print(f"  XGMI Node ID: {xgmi['xgmi_node_id']}")
        print(f"  XGMI Lanes: {xgmi['xgmi_lanes']}")

        # Check XGMI status
        xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
        print(f"  XGMI Status: {xgmi_status}")
        print()

    # Analyze GPU-to-GPU connectivity
    if len(processors) >= 2:
        print("GPU-to-GPU Connectivity:")
        for i in range(len(processors)):
            for j in range(i + 1, len(processors)):
                print(f"\nGPU {i} <-> GPU {j}:")

                # Check P2P accessibility
                accessible = amdsmi.amdsmi_is_P2P_accessible(
                    processors[i], processors[j]
                )
                print(f"  P2P Accessible: {accessible}")

                # Get link type
                link_info = amdsmi.amdsmi_topo_get_link_type(
                    processors[i], processors[j]
                )
                print(f"  Link Type: {link_info['type']}")
                print(f"  Hops: {link_info['hops']}")

                # Get link weight
                weight = amdsmi.amdsmi_topo_get_link_weight(
                    processors[i], processors[j]
                )
                print(f"  Link Weight: {weight}")

                # Get bandwidth
                bw = amdsmi.amdsmi_get_minmax_bandwidth_between_processors(
                    processors[i], processors[j]
                )
                print(f"  Min Bandwidth: {bw['min_bandwidth']} MB/s")
                print(f"  Max Bandwidth: {bw['max_bandwidth']} MB/s")

finally:
    amdsmi.amdsmi_shut_down()

Monitor PCIe Performance

Monitor PCIe throughput and link quality:

import amdsmi
import time

amdsmi.amdsmi_init()
try:
    processors = amdsmi.amdsmi_get_processor_handles()

    for i, processor in enumerate(processors):
        print(f"GPU {i} PCIe Information:")

        # Get bandwidth configuration
        bw = amdsmi.amdsmi_get_gpu_pci_bandwidth(processor)
        print(f"  Current PCIe rate index: {bw['transfer_rate']['current']}")
        print(f"  Supported rates: {bw['transfer_rate']['num_supported']}")

        # Get BDF ID
        bdf_id = amdsmi.amdsmi_get_gpu_bdf_id(processor)
        print(f"  BDF ID: 0x{bdf_id:016x}")

        # Get replay counter
        replay_count = amdsmi.amdsmi_get_gpu_pci_replay_counter(processor)
        print(f"  PCIe Replay Count: {replay_count}")
        if replay_count > 0:
            print(f"  WARNING: PCIe link quality issues detected!")

        # Measure throughput over 1 second
        throughput1 = amdsmi.amdsmi_get_gpu_pci_throughput(processor)
        time.sleep(1.0)
        throughput2 = amdsmi.amdsmi_get_gpu_pci_throughput(processor)

        sent_rate = (throughput2['sent'] - throughput1['sent']) / 1e6
        recv_rate = (throughput2['received'] - throughput1['received']) / 1e6
        print(f"  PCIe Throughput (MB/s):")
        print(f"    Sent: {sent_rate:.2f}")
        print(f"    Received: {recv_rate:.2f}")
        print()

finally:
    amdsmi.amdsmi_shut_down()

XGMI Link Monitoring

Monitor XGMI link metrics and health:

import amdsmi

amdsmi.amdsmi_init()
try:
    processors = amdsmi.amdsmi_get_processor_handles()

    for i, processor in enumerate(processors):
        print(f"GPU {i} XGMI Link Metrics:")

        # Get link metrics
        metrics = amdsmi.amdsmi_get_link_metrics(processor)
        print(f"  Number of links: {metrics['num_links']}")

        for link_idx in range(metrics['num_links']):
            link = metrics['links'][link_idx]
            print(f"\n  Link {link_idx}:")
            print(f"    Connected to BDF: {link['bdf']}")
            print(f"    Max Bandwidth: {link['max_bandwidth']} MB/s")
            print(f"    Current Read: {link['read']} MB/s")
            print(f"    Current Write: {link['write']} MB/s")
            print(f"    Bit Rate: {link['bit_rate']}")

        # Check XGMI error status
        status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
        print(f"\n  XGMI Error Status: {status}")

        if status != "NO_ERRORS":
            print(f"  WARNING: XGMI errors detected on GPU {i}!")
            # Optionally clear errors
            # amdsmi.amdsmi_reset_gpu_xgmi_error(processor)

        print()

finally:
    amdsmi.amdsmi_shut_down()

P2P Capability Analysis

Analyze P2P capabilities in detail:

import amdsmi

amdsmi.amdsmi_init()
try:
    processors = amdsmi.amdsmi_get_processor_handles()

    if len(processors) < 2:
        print("Need at least 2 GPUs for P2P analysis")
    else:
        print("P2P Capability Matrix:\n")

        # Create matrix header
        print("     ", end="")
        for i in range(len(processors)):
            print(f"GPU{i:2d} ", end="")
        print()

        # Print P2P accessibility matrix
        for i in range(len(processors)):
            print(f"GPU{i:2d}", end="")
            for j in range(len(processors)):
                if i == j:
                    print("  -  ", end="")
                else:
                    accessible = amdsmi.amdsmi_is_P2P_accessible(
                        processors[i], processors[j]
                    )
                    print(f"  {'Y' if accessible else 'N'}  ", end="")
            print()

        # Detailed P2P capabilities for first pair
        if len(processors) >= 2:
            print(f"\nDetailed P2P capabilities: GPU 0 <-> GPU 1")
            p2p = amdsmi.amdsmi_topo_get_p2p_status(processors[0], processors[1])

            print(f"  Cache Coherent: {p2p['cap']['is_iolink_coherent']}")
            print(f"  32-bit Atomics: {p2p['cap']['is_iolink_atomics_32bit']}")
            print(f"  64-bit Atomics: {p2p['cap']['is_iolink_atomics_64bit']}")
            print(f"  DMA Support: {p2p['cap']['is_iolink_dma']}")
            print(f"  Bi-directional: {p2p['cap']['is_iolink_bi_directional']}")

finally:
    amdsmi.amdsmi_shut_down()

Find XGMI Neighbors

Find all GPUs connected via XGMI to a specific GPU:

import amdsmi

amdsmi.amdsmi_init()
try:
    processors = amdsmi.amdsmi_get_processor_handles()

    if processors:
        gpu_0 = processors[0]
        print("Finding XGMI neighbors of GPU 0...")

        # Get nearest XGMI-connected processors
        nearest = amdsmi.amdsmi_get_link_topology_nearest(
            gpu_0,
            amdsmi.AmdSmiLinkType.AMDSMI_LINK_TYPE_XGMI
        )

        xgmi_neighbors = nearest['processor_list']
        print(f"Found {len(xgmi_neighbors)} XGMI-connected GPUs")

        # Get details about each neighbor
        for neighbor in xgmi_neighbors:
            bdf = amdsmi.amdsmi_get_gpu_device_bdf(neighbor)
            xgmi_info = amdsmi.amdsmi_get_xgmi_info(neighbor)
            print(f"  BDF: {bdf}")
            print(f"    Node ID: {xgmi_info['xgmi_node_id']}")
            print(f"    Lanes: {xgmi_info['xgmi_lanes']}")

finally:
    amdsmi.amdsmi_shut_down()

Notes

  • NUMA affinity is critical for performance optimization in multi-socket systems
  • XGMI provides higher bandwidth and lower latency than PCIe for GPU-to-GPU communication
  • Link weights help algorithms choose optimal data placement and routing
  • P2P capabilities vary by system configuration and GPU models
  • PCIe replay counters indicate link quality - high values suggest hardware issues
  • XGMI hive ID groups GPUs that can communicate via XGMI
  • Not all topology features are available on all GPU generations or configurations
  • Some functions may require elevated privileges depending on the system configuration
  • Topology information is particularly important for multi-GPU machine learning and HPC workloads