Functions for querying hardware topology information including NUMA affinity, inter-processor connectivity, PCIe bandwidth, XGMI links, and P2P (peer-to-peer) access capabilities. These functions help understand the physical and logical layout of processors and their interconnections in multi-GPU and heterogeneous computing systems.
Get the NUMA (Non-Uniform Memory Access) node number for a processor.
def amdsmi_topo_get_numa_node_number(processor_handle: processor_handle) -> int:
"""
Get the NUMA node number associated with a processor.
NUMA nodes represent memory regions with different access latencies from different processors.
This function returns the NUMA node closest to the specified processor, which is useful for
optimizing memory allocation and data placement in NUMA systems.
Parameters:
- processor_handle (processor_handle): Handle for the processor to query
Returns:
- int: NUMA node number (typically 0-indexed)
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve NUMA node information
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for processor in processors:
numa_node = amdsmi.amdsmi_topo_get_numa_node_number(processor)
print(f"Processor NUMA node: {numa_node}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the relative weight (distance) of the link between two processors.
def amdsmi_topo_get_link_weight(
processor_handle_src: processor_handle,
processor_handle_dst: processor_handle
) -> int:
"""
Get the link weight (distance metric) between two processors.
The link weight represents the relative cost or distance of communication between two
processors. Lower weights indicate closer/faster connections. This metric helps determine
optimal data placement and communication patterns in multi-processor systems.
Parameters:
- processor_handle_src (processor_handle): Handle for the source processor
- processor_handle_dst (processor_handle): Handle for the destination processor
Returns:
- int: Link weight value (lower values indicate closer proximity)
Raises:
- AmdSmiParameterException: If either processor handle is not valid
- AmdSmiLibraryException: If unable to retrieve link weight
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) >= 2:
weight = amdsmi.amdsmi_topo_get_link_weight(processors[0], processors[1])
print(f"Link weight between GPU 0 and GPU 1: {weight}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the minimum and maximum bandwidth between two processors.
def amdsmi_get_minmax_bandwidth_between_processors(
processor_handle_src: processor_handle,
processor_handle_dst: processor_handle
) -> Dict[str, int]:
"""
Get the minimum and maximum theoretical bandwidth between two processors.
Returns the bandwidth capabilities of the link between two processors, which is useful
for understanding data transfer performance characteristics and optimizing workload
distribution.
Parameters:
- processor_handle_src (processor_handle): Handle for the source processor
- processor_handle_dst (processor_handle): Handle for the destination processor
Returns:
- Dict[str, int]: Dictionary containing:
- "min_bandwidth" (int): Minimum bandwidth in MB/s
- "max_bandwidth" (int): Maximum bandwidth in MB/s
Raises:
- AmdSmiParameterException: If either processor handle is not valid
- AmdSmiLibraryException: If unable to retrieve bandwidth information
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) >= 2:
bw = amdsmi.amdsmi_get_minmax_bandwidth_between_processors(
processors[0], processors[1]
)
print(f"Min bandwidth: {bw['min_bandwidth']} MB/s")
print(f"Max bandwidth: {bw['max_bandwidth']} MB/s")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get detailed link metrics for all connections from a processor.
def amdsmi_get_link_metrics(processor_handle: processor_handle) -> Dict[str, Any]:
"""
Get comprehensive link metrics for all connections from a processor.
Returns detailed information about all links connected to the specified processor,
including link types, bandwidth, and data transfer statistics. This is particularly
useful for XGMI-connected GPUs.
Parameters:
- processor_handle (processor_handle): Handle for the processor to query
Returns:
- Dict[str, Any]: Dictionary containing:
- "num_links" (int): Number of active links
- "links" (List[Dict]): List of link information dictionaries, each containing:
- "bdf" (str): BDF address of the connected device
- "bit_rate" (int): Link bit rate
- "max_bandwidth" (int): Maximum bandwidth in MB/s
- "link_type" (int): Type of link (XGMI, PCIe, etc.)
- "read" (int): Read bandwidth usage
- "write" (int): Write bandwidth usage
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve link metrics
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for processor in processors:
metrics = amdsmi.amdsmi_get_link_metrics(processor)
print(f"Number of links: {metrics['num_links']}")
for i, link in enumerate(metrics['links'][:metrics['num_links']]):
print(f" Link {i}:")
print(f" Connected to: {link['bdf']}")
print(f" Max bandwidth: {link['max_bandwidth']} MB/s")
print(f" Read: {link['read']} MB/s")
print(f" Write: {link['write']} MB/s")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the type of link between two processors.
def amdsmi_topo_get_link_type(
processor_handle_src: processor_handle,
processor_handle_dst: processor_handle
) -> Dict[str, int]:
"""
Get the link type and hop count between two processors.
Determines the type of interconnect (XGMI, PCIe, internal, etc.) between two processors
and the number of hops required for communication.
Parameters:
- processor_handle_src (processor_handle): Handle for the source processor
- processor_handle_dst (processor_handle): Handle for the destination processor
Returns:
- Dict[str, int]: Dictionary containing:
- "hops" (int): Number of hops between processors
- "type" (int): Link type enum value (see AmdSmiLinkType)
Raises:
- AmdSmiParameterException: If either processor handle is not valid
- AmdSmiLibraryException: If unable to retrieve link type
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) >= 2:
link_info = amdsmi.amdsmi_topo_get_link_type(processors[0], processors[1])
print(f"Link hops: {link_info['hops']}")
print(f"Link type: {link_info['type']}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the peer-to-peer (P2P) status and capabilities between two processors.
def amdsmi_topo_get_p2p_status(
processor_handle_src: processor_handle,
processor_handle_dst: processor_handle
) -> Dict[str, Any]:
"""
Get P2P (peer-to-peer) status and capabilities between two processors.
Returns detailed information about P2P connectivity capabilities including coherency,
atomics support, DMA capabilities, and bi-directional communication support.
Parameters:
- processor_handle_src (processor_handle): Handle for the source processor
- processor_handle_dst (processor_handle): Handle for the destination processor
Returns:
- Dict[str, Any]: Dictionary containing:
- "type" (int): P2P connection type
- "cap" (Dict[str, bool]): Capability flags dictionary with:
- "is_iolink_coherent" (bool): Whether the I/O link is cache coherent
- "is_iolink_atomics_32bit" (bool): 32-bit atomic operations supported
- "is_iolink_atomics_64bit" (bool): 64-bit atomic operations supported
- "is_iolink_dma" (bool): DMA (Direct Memory Access) supported
- "is_iolink_bi_directional" (bool): Bi-directional transfers supported
Raises:
- AmdSmiParameterException: If either processor handle is not valid
- AmdSmiLibraryException: If unable to retrieve P2P status
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) >= 2:
p2p = amdsmi.amdsmi_topo_get_p2p_status(processors[0], processors[1])
print(f"P2P capabilities between GPU 0 and GPU 1:")
print(f" Coherent: {p2p['cap']['is_iolink_coherent']}")
print(f" 32-bit atomics: {p2p['cap']['is_iolink_atomics_32bit']}")
print(f" 64-bit atomics: {p2p['cap']['is_iolink_atomics_64bit']}")
print(f" DMA: {p2p['cap']['is_iolink_dma']}")
print(f" Bi-directional: {p2p['cap']['is_iolink_bi_directional']}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Check if P2P access is possible between two processors.
def amdsmi_is_P2P_accessible(
processor_handle_src: processor_handle,
processor_handle_dst: processor_handle
) -> bool:
"""
Check if P2P (peer-to-peer) access is accessible between two processors.
Returns a simple boolean indicating whether direct P2P memory access is possible
between the two specified processors. This is useful for quickly determining if
direct GPU-to-GPU communication is available.
Parameters:
- processor_handle_src (processor_handle): Handle for the source processor
- processor_handle_dst (processor_handle): Handle for the destination processor
Returns:
- bool: True if P2P access is accessible, False otherwise
Raises:
- AmdSmiParameterException: If either processor handle is not valid
- AmdSmiLibraryException: If unable to determine P2P accessibility
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) >= 2:
accessible = amdsmi.amdsmi_is_P2P_accessible(processors[0], processors[1])
if accessible:
print("P2P access is available between GPU 0 and GPU 1")
else:
print("P2P access is NOT available between GPU 0 and GPU 1")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get XGMI (AMD Infinity Fabric) link information for a processor.
def amdsmi_get_xgmi_info(processor_handle: processor_handle) -> Dict[str, int]:
"""
Get XGMI (AMD Infinity Fabric) information for a processor.
XGMI is AMD's high-speed interconnect technology that enables efficient GPU-to-GPU
communication. This function returns XGMI-specific identifiers and configuration.
Parameters:
- processor_handle (processor_handle): Handle for the processor to query
Returns:
- Dict[str, int]: Dictionary containing:
- "xgmi_lanes" (int): Number of XGMI lanes
- "xgmi_hive_id" (int): XGMI hive identifier (GPUs in same hive can communicate)
- "xgmi_node_id" (int): Unique node ID within the hive
- "index" (int): Index of the device
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve XGMI information
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
xgmi = amdsmi.amdsmi_get_xgmi_info(processor)
print(f"GPU {i} XGMI info:")
print(f" Lanes: {xgmi['xgmi_lanes']}")
print(f" Hive ID: {xgmi['xgmi_hive_id']}")
print(f" Node ID: {xgmi['xgmi_node_id']}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get nearest processors of a specific link type.
def amdsmi_get_link_topology_nearest(
processor_handle: processor_handle,
link_type: AmdSmiLinkType
) -> Dict[str, List[processor_handle]]:
"""
Get the list of nearest processors connected by a specific link type.
Returns a list of processor handles that are connected to the specified processor
via the requested link type (XGMI, PCIe, etc.). This helps identify topology
neighborhoods for optimization.
Parameters:
- processor_handle (processor_handle): Handle for the processor to query
- link_type (AmdSmiLinkType): Type of link to search for (XGMI, PCIe, etc.)
Returns:
- Dict[str, List[processor_handle]]: Dictionary containing:
- "processor_list" (List[processor_handle]): List of connected processor handles
Raises:
- AmdSmiParameterException: If processor_handle or link_type is not valid
- AmdSmiLibraryException: If unable to retrieve topology information
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if processors:
# Find all GPUs connected via XGMI
nearest = amdsmi.amdsmi_get_link_topology_nearest(
processors[0],
amdsmi.AmdSmiLinkType.AMDSMI_LINK_TYPE_XGMI
)
print(f"Found {len(nearest['processor_list'])} XGMI-connected GPUs")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the BDF (Bus:Device.Function) ID as an integer for a GPU.
def amdsmi_get_gpu_bdf_id(processor_handle: processor_handle) -> int:
"""
Get the BDF (Bus:Device.Function) identifier as a 64-bit integer.
Returns the PCI BDF address encoded as an integer value. This is an alternative
to amdsmi_get_gpu_device_bdf() which returns a formatted string.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- int: BDF ID encoded as a 64-bit integer
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve BDF ID
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for processor in processors:
bdf_id = amdsmi.amdsmi_get_gpu_bdf_id(processor)
print(f"GPU BDF ID: 0x{bdf_id:016x}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the PCIe bandwidth configuration for a GPU.
def amdsmi_get_gpu_pci_bandwidth(processor_handle: processor_handle) -> Dict[str, Any]:
"""
Get PCIe bandwidth configuration including transfer rates and lanes.
Returns information about the GPU's PCIe connection, including supported and current
transfer rates (PCIe generations) and lane configurations.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- Dict[str, Any]: Dictionary containing:
- "transfer_rate" (Dict): Transfer rate information with:
- "num_supported" (int): Number of supported transfer rates
- "current" (int): Current transfer rate index
- "frequency" (List[int]): List of supported frequencies
- "lanes" (List[int]): PCIe lane configurations
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve bandwidth information
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for processor in processors:
bw = amdsmi.amdsmi_get_gpu_pci_bandwidth(processor)
print(f"PCIe bandwidth:")
print(f" Current rate index: {bw['transfer_rate']['current']}")
print(f" Supported rates: {bw['transfer_rate']['num_supported']}")
print(f" Frequencies: {bw['transfer_rate']['frequency']}")
print(f" Lane configs: {bw['lanes']}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get current PCIe throughput statistics for a GPU.
def amdsmi_get_gpu_pci_throughput(processor_handle: processor_handle) -> Dict[str, int]:
"""
Get current PCIe throughput statistics including sent and received data.
Returns real-time PCIe data transfer statistics, useful for monitoring actual
PCIe bus utilization and identifying potential bottlenecks.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- Dict[str, int]: Dictionary containing:
- "sent" (int): Number of bytes sent over PCIe
- "received" (int): Number of bytes received over PCIe
- "max_pkt_sz" (int): Maximum packet size in bytes
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve throughput information
Example:
```python
import amdsmi
import time
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if processors:
# Measure throughput over time
throughput1 = amdsmi.amdsmi_get_gpu_pci_throughput(processors[0])
time.sleep(1)
throughput2 = amdsmi.amdsmi_get_gpu_pci_throughput(processors[0])
sent_rate = (throughput2['sent'] - throughput1['sent']) / 1e6
recv_rate = (throughput2['received'] - throughput1['received']) / 1e6
print(f"PCIe throughput (MB/s):")
print(f" Sent: {sent_rate:.2f}")
print(f" Received: {recv_rate:.2f}")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the PCIe replay counter for a GPU.
def amdsmi_get_gpu_pci_replay_counter(processor_handle: processor_handle) -> int:
"""
Get the PCIe replay counter value for a GPU.
The PCIe replay counter tracks the number of packet retransmissions on the PCIe bus.
A high or increasing replay count may indicate signal integrity issues, electrical
problems, or other PCIe link quality concerns.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- int: Number of PCIe packet replays
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve replay counter
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
replay_count = amdsmi.amdsmi_get_gpu_pci_replay_counter(processor)
print(f"GPU {i} PCIe replay count: {replay_count}")
if replay_count > 0:
print(f" Warning: GPU {i} has PCIe replay events")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the NUMA affinity for a GPU (alternative to amdsmi_topo_get_numa_node_number).
def amdsmi_get_gpu_topo_numa_affinity(processor_handle: processor_handle) -> int:
"""
Get the NUMA affinity (node number) for a GPU.
This is an alternative function to amdsmi_topo_get_numa_node_number() that specifically
targets GPU devices. Returns the NUMA node that the GPU is closest to, which is critical
for optimizing memory allocation in NUMA systems.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- int: NUMA node number (-1 if not applicable or unknown)
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve NUMA affinity
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
numa_node = amdsmi.amdsmi_get_gpu_topo_numa_affinity(processor)
if numa_node >= 0:
print(f"GPU {i} is on NUMA node {numa_node}")
else:
print(f"GPU {i} has no NUMA affinity")
finally:
amdsmi.amdsmi_shut_down()
```
"""Get the XGMI error status for a GPU.
def amdsmi_gpu_xgmi_error_status(processor_handle: processor_handle) -> str:
"""
Get the XGMI error status for a GPU.
Checks for XGMI link errors on the specified GPU. XGMI errors can indicate
hardware issues, signal integrity problems, or other connectivity concerns
in multi-GPU XGMI configurations.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- str: XGMI status string from AmdSmiXgmiStatus enum:
- "NO_ERRORS": No XGMI errors detected
- "ERROR": Single error detected
- "MULTIPLE_ERRORS": Multiple errors detected
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to retrieve XGMI error status
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
print(f"GPU {i} XGMI status: {status}")
if status != "NO_ERRORS":
print(f" Warning: GPU {i} has XGMI errors!")
finally:
amdsmi.amdsmi_shut_down()
```
"""Reset (clear) XGMI error counters for a GPU.
def amdsmi_reset_gpu_xgmi_error(processor_handle: processor_handle) -> None:
"""
Reset (clear) XGMI error counters for a GPU.
Clears the XGMI error status for the specified GPU. This is useful after
acknowledging and addressing XGMI errors, allowing fresh monitoring of
link health.
Parameters:
- processor_handle (processor_handle): Handle for the GPU processor
Returns:
- None
Raises:
- AmdSmiParameterException: If processor_handle is not valid
- AmdSmiLibraryException: If unable to reset XGMI errors
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for processor in processors:
# Check for errors
status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
if status != "NO_ERRORS":
print(f"Detected XGMI errors: {status}")
# Clear the errors
amdsmi.amdsmi_reset_gpu_xgmi_error(processor)
print("XGMI errors cleared")
finally:
amdsmi.amdsmi_shut_down()
```
"""Link type enumeration for topology queries.
class AmdSmiLinkType(IntEnum):
"""
Enumeration of link types between processors.
Used to specify or identify the type of interconnect between processors
in topology queries.
"""
AMDSMI_LINK_TYPE_INTERNAL = ... # Internal/on-chip connection
AMDSMI_LINK_TYPE_XGMI = ... # XGMI (AMD Infinity Fabric) connection
AMDSMI_LINK_TYPE_PCIE = ... # PCIe connection
AMDSMI_LINK_TYPE_NOT_APPLICABLE = ... # No connection or not applicable
AMDSMI_LINK_TYPE_UNKNOWN = ... # Unknown link typeXGMI error status enumeration.
class AmdSmiXgmiStatus(IntEnum):
"""
Enumeration of XGMI error status values.
Indicates the error state of XGMI links on a GPU.
"""
NO_ERRORS = ... # No XGMI errors detected
ERROR = ... # Single XGMI error detected
MULTIPLE_ERRORS = ... # Multiple XGMI errors detectedDiscover and analyze the topology of a multi-GPU system:
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
print(f"Found {len(processors)} GPUs\n")
# Print topology information for each GPU
for i, processor in enumerate(processors):
print(f"GPU {i}:")
# Get BDF and NUMA info
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processor)
numa_node = amdsmi.amdsmi_get_gpu_topo_numa_affinity(processor)
print(f" BDF: {bdf}")
print(f" NUMA Node: {numa_node}")
# Get XGMI info
xgmi = amdsmi.amdsmi_get_xgmi_info(processor)
print(f" XGMI Hive ID: {xgmi['xgmi_hive_id']}")
print(f" XGMI Node ID: {xgmi['xgmi_node_id']}")
print(f" XGMI Lanes: {xgmi['xgmi_lanes']}")
# Check XGMI status
xgmi_status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
print(f" XGMI Status: {xgmi_status}")
print()
# Analyze GPU-to-GPU connectivity
if len(processors) >= 2:
print("GPU-to-GPU Connectivity:")
for i in range(len(processors)):
for j in range(i + 1, len(processors)):
print(f"\nGPU {i} <-> GPU {j}:")
# Check P2P accessibility
accessible = amdsmi.amdsmi_is_P2P_accessible(
processors[i], processors[j]
)
print(f" P2P Accessible: {accessible}")
# Get link type
link_info = amdsmi.amdsmi_topo_get_link_type(
processors[i], processors[j]
)
print(f" Link Type: {link_info['type']}")
print(f" Hops: {link_info['hops']}")
# Get link weight
weight = amdsmi.amdsmi_topo_get_link_weight(
processors[i], processors[j]
)
print(f" Link Weight: {weight}")
# Get bandwidth
bw = amdsmi.amdsmi_get_minmax_bandwidth_between_processors(
processors[i], processors[j]
)
print(f" Min Bandwidth: {bw['min_bandwidth']} MB/s")
print(f" Max Bandwidth: {bw['max_bandwidth']} MB/s")
finally:
amdsmi.amdsmi_shut_down()Monitor PCIe throughput and link quality:
import amdsmi
import time
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
print(f"GPU {i} PCIe Information:")
# Get bandwidth configuration
bw = amdsmi.amdsmi_get_gpu_pci_bandwidth(processor)
print(f" Current PCIe rate index: {bw['transfer_rate']['current']}")
print(f" Supported rates: {bw['transfer_rate']['num_supported']}")
# Get BDF ID
bdf_id = amdsmi.amdsmi_get_gpu_bdf_id(processor)
print(f" BDF ID: 0x{bdf_id:016x}")
# Get replay counter
replay_count = amdsmi.amdsmi_get_gpu_pci_replay_counter(processor)
print(f" PCIe Replay Count: {replay_count}")
if replay_count > 0:
print(f" WARNING: PCIe link quality issues detected!")
# Measure throughput over 1 second
throughput1 = amdsmi.amdsmi_get_gpu_pci_throughput(processor)
time.sleep(1.0)
throughput2 = amdsmi.amdsmi_get_gpu_pci_throughput(processor)
sent_rate = (throughput2['sent'] - throughput1['sent']) / 1e6
recv_rate = (throughput2['received'] - throughput1['received']) / 1e6
print(f" PCIe Throughput (MB/s):")
print(f" Sent: {sent_rate:.2f}")
print(f" Received: {recv_rate:.2f}")
print()
finally:
amdsmi.amdsmi_shut_down()Monitor XGMI link metrics and health:
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
for i, processor in enumerate(processors):
print(f"GPU {i} XGMI Link Metrics:")
# Get link metrics
metrics = amdsmi.amdsmi_get_link_metrics(processor)
print(f" Number of links: {metrics['num_links']}")
for link_idx in range(metrics['num_links']):
link = metrics['links'][link_idx]
print(f"\n Link {link_idx}:")
print(f" Connected to BDF: {link['bdf']}")
print(f" Max Bandwidth: {link['max_bandwidth']} MB/s")
print(f" Current Read: {link['read']} MB/s")
print(f" Current Write: {link['write']} MB/s")
print(f" Bit Rate: {link['bit_rate']}")
# Check XGMI error status
status = amdsmi.amdsmi_gpu_xgmi_error_status(processor)
print(f"\n XGMI Error Status: {status}")
if status != "NO_ERRORS":
print(f" WARNING: XGMI errors detected on GPU {i}!")
# Optionally clear errors
# amdsmi.amdsmi_reset_gpu_xgmi_error(processor)
print()
finally:
amdsmi.amdsmi_shut_down()Analyze P2P capabilities in detail:
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if len(processors) < 2:
print("Need at least 2 GPUs for P2P analysis")
else:
print("P2P Capability Matrix:\n")
# Create matrix header
print(" ", end="")
for i in range(len(processors)):
print(f"GPU{i:2d} ", end="")
print()
# Print P2P accessibility matrix
for i in range(len(processors)):
print(f"GPU{i:2d}", end="")
for j in range(len(processors)):
if i == j:
print(" - ", end="")
else:
accessible = amdsmi.amdsmi_is_P2P_accessible(
processors[i], processors[j]
)
print(f" {'Y' if accessible else 'N'} ", end="")
print()
# Detailed P2P capabilities for first pair
if len(processors) >= 2:
print(f"\nDetailed P2P capabilities: GPU 0 <-> GPU 1")
p2p = amdsmi.amdsmi_topo_get_p2p_status(processors[0], processors[1])
print(f" Cache Coherent: {p2p['cap']['is_iolink_coherent']}")
print(f" 32-bit Atomics: {p2p['cap']['is_iolink_atomics_32bit']}")
print(f" 64-bit Atomics: {p2p['cap']['is_iolink_atomics_64bit']}")
print(f" DMA Support: {p2p['cap']['is_iolink_dma']}")
print(f" Bi-directional: {p2p['cap']['is_iolink_bi_directional']}")
finally:
amdsmi.amdsmi_shut_down()Find all GPUs connected via XGMI to a specific GPU:
import amdsmi
amdsmi.amdsmi_init()
try:
processors = amdsmi.amdsmi_get_processor_handles()
if processors:
gpu_0 = processors[0]
print("Finding XGMI neighbors of GPU 0...")
# Get nearest XGMI-connected processors
nearest = amdsmi.amdsmi_get_link_topology_nearest(
gpu_0,
amdsmi.AmdSmiLinkType.AMDSMI_LINK_TYPE_XGMI
)
xgmi_neighbors = nearest['processor_list']
print(f"Found {len(xgmi_neighbors)} XGMI-connected GPUs")
# Get details about each neighbor
for neighbor in xgmi_neighbors:
bdf = amdsmi.amdsmi_get_gpu_device_bdf(neighbor)
xgmi_info = amdsmi.amdsmi_get_xgmi_info(neighbor)
print(f" BDF: {bdf}")
print(f" Node ID: {xgmi_info['xgmi_node_id']}")
print(f" Lanes: {xgmi_info['xgmi_lanes']}")
finally:
amdsmi.amdsmi_shut_down()