or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

examples

edge-cases.mdreal-world-scenarios.md
index.md
tile.json

gpu-performance-counters.mddocs/reference/

GPU Performance Counters

Hardware performance counter system for monitoring low-level GPU metrics including XGMI (AMD's high-speed GPU-to-GPU interconnect) traffic statistics. Performance counters provide fine-grained visibility into GPU subsystem activity for profiling and optimization.

Capabilities

Counter Group Support

Check if a GPU supports a specific performance counter group before attempting to use it.

def amdsmi_gpu_counter_group_supported(
    processor_handle: processor_handle,
    event_group: AmdSmiEventGroup
) -> None:
    """
    Check if a performance counter group is supported by the GPU.

    Verifies that the specified event group is available on the target GPU hardware.
    This should be called before attempting to create counters from a specific group.

    Parameters:
    - processor_handle: Handle for the target GPU device
    - event_group (AmdSmiEventGroup): The counter group to check. Valid values:
      - AmdSmiEventGroup.XGMI: XGMI link activity counters
      - AmdSmiEventGroup.XGMI_DATA_OUT: XGMI data transmission counters

    Returns:
    - None: Function returns successfully if the group is supported

    Raises:
    - AmdSmiParameterException: If processor_handle or event_group is invalid
    - AmdSmiLibraryException: If the counter group is not supported or on query failure

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiEventGroup

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Check if XGMI counters are supported
    try:
        amdsmi.amdsmi_gpu_counter_group_supported(device, AmdSmiEventGroup.XGMI)
        print("XGMI counters are supported")
    except Exception as e:
        print(f"XGMI counters not supported: {e}")

    amdsmi.amdsmi_shut_down()
    ```
    """

Counter Availability

Query the number of available counters in a specific group.

def amdsmi_get_gpu_available_counters(
    processor_handle: processor_handle,
    event_group: AmdSmiEventGroup
) -> int:
    """
    Get the number of available performance counters in a group.

    Returns the count of hardware counters that can be simultaneously created
    from the specified event group. This helps determine resource limits before
    creating multiple counters.

    Parameters:
    - processor_handle: Handle for the target GPU device
    - event_group (AmdSmiEventGroup): The counter group to query. Valid values:
      - AmdSmiEventGroup.XGMI: XGMI link activity counters
      - AmdSmiEventGroup.XGMI_DATA_OUT: XGMI data transmission counters

    Returns:
    - int: Number of available counters in the specified group

    Raises:
    - AmdSmiParameterException: If processor_handle or event_group is invalid
    - AmdSmiLibraryException: On query failure

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiEventGroup

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Check how many XGMI counters are available
    xgmi_count = amdsmi.amdsmi_get_gpu_available_counters(
        device,
        AmdSmiEventGroup.XGMI
    )
    print(f"Available XGMI counters: {xgmi_count}")

    xgmi_data_count = amdsmi.amdsmi_get_gpu_available_counters(
        device,
        AmdSmiEventGroup.XGMI_DATA_OUT
    )
    print(f"Available XGMI data out counters: {xgmi_data_count}")

    amdsmi.amdsmi_shut_down()
    ```
    """

Counter Creation

Create a performance counter for a specific event type.

def amdsmi_gpu_create_counter(
    processor_handle: processor_handle,
    event_type: AmdSmiEventType
) -> amdsmi_event_handle_t:
    """
    Create a performance counter for monitoring a specific event.

    Allocates a hardware counter resource to track the specified event type.
    The returned event handle is used for all subsequent counter operations.
    Counters must be destroyed with amdsmi_gpu_destroy_counter when no longer needed.

    Parameters:
    - processor_handle: Handle for the target GPU device
    - event_type (AmdSmiEventType): The specific event to monitor. Valid values include:
      - XGMI link 0 events:
        - AmdSmiEventType.XGMI_0_NOP_TX: NOP transactions transmitted
        - AmdSmiEventType.XGMI_0_REQUEST_TX: Request packets transmitted
        - AmdSmiEventType.XGMI_0_RESPONSE_TX: Response packets transmitted
        - AmdSmiEventType.XGMI_0_BEATS_TX: Data beats transmitted
      - XGMI link 1 events:
        - AmdSmiEventType.XGMI_1_NOP_TX: NOP transactions transmitted
        - AmdSmiEventType.XGMI_1_REQUEST_TX: Request packets transmitted
        - AmdSmiEventType.XGMI_1_RESPONSE_TX: Response packets transmitted
        - AmdSmiEventType.XGMI_1_BEATS_TX: Data beats transmitted
      - XGMI data output events (links 0-5):
        - AmdSmiEventType.XGMI_DATA_OUT_0 through XGMI_DATA_OUT_5

    Returns:
    - amdsmi_event_handle_t: Handle to the created counter, used for control and read operations

    Raises:
    - AmdSmiParameterException: If processor_handle or event_type is invalid
    - AmdSmiLibraryException: If counter creation fails (e.g., no available counters)

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiEventType

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Create a counter for XGMI link 0 request transmissions
    counter = amdsmi.amdsmi_gpu_create_counter(
        device,
        AmdSmiEventType.XGMI_0_REQUEST_TX
    )
    print(f"Created counter: {counter}")

    # Remember to destroy the counter when done
    amdsmi.amdsmi_gpu_destroy_counter(counter)
    amdsmi.amdsmi_shut_down()
    ```
    """

Counter Control

Start and stop performance counter data collection.

def amdsmi_gpu_control_counter(
    event_handle: amdsmi_event_handle_t,
    counter_command: AmdSmiCounterCommand
) -> None:
    """
    Control performance counter operation (start/stop).

    Sends a control command to a performance counter to begin or end data collection.
    Counters must be started before reading values and should be stopped when
    collection is complete.

    Parameters:
    - event_handle (amdsmi_event_handle_t): Handle returned by amdsmi_gpu_create_counter
    - counter_command (AmdSmiCounterCommand): Command to execute. Valid values:
      - AmdSmiCounterCommand.CMD_START: Begin counter data collection
      - AmdSmiCounterCommand.CMD_STOP: Stop counter data collection

    Returns:
    - None: Function completes successfully

    Raises:
    - AmdSmiParameterException: If event_handle or counter_command is invalid
    - AmdSmiLibraryException: If the control operation fails

    Example:
    ```python
    import amdsmi
    import time
    from amdsmi import AmdSmiEventType, AmdSmiCounterCommand

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Create and start a counter
    counter = amdsmi.amdsmi_gpu_create_counter(
        device,
        AmdSmiEventType.XGMI_0_BEATS_TX
    )

    # Start collecting data
    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_START)
    print("Counter started")

    # Collect data for 5 seconds
    time.sleep(5)

    # Stop the counter
    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_STOP)
    print("Counter stopped")

    # Clean up
    amdsmi.amdsmi_gpu_destroy_counter(counter)
    amdsmi.amdsmi_shut_down()
    ```
    """

Counter Reading

Read the current value and timing information from a performance counter.

def amdsmi_gpu_read_counter(
    event_handle: amdsmi_event_handle_t
) -> Dict[str, Any]:
    """
    Read the current value from a performance counter.

    Retrieves the accumulated counter value along with timing information about
    how long the counter has been enabled and actively counting. This allows
    calculation of average rates over time.

    Parameters:
    - event_handle (amdsmi_event_handle_t): Handle returned by amdsmi_gpu_create_counter

    Returns:
    - dict: Dictionary containing counter information:
      - value (int): Accumulated counter value (event count)
      - time_enabled (int): Total time counter has been enabled (nanoseconds)
      - time_running (int): Total time counter has been actively counting (nanoseconds)

    Raises:
    - AmdSmiParameterException: If event_handle is invalid
    - AmdSmiLibraryException: On read failure

    Example:
    ```python
    import amdsmi
    import time
    from amdsmi import AmdSmiEventType, AmdSmiCounterCommand

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    counter = amdsmi.amdsmi_gpu_create_counter(
        device,
        AmdSmiEventType.XGMI_0_BEATS_TX
    )

    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_START)
    time.sleep(2)

    # Read the counter
    data = amdsmi.amdsmi_gpu_read_counter(counter)
    print(f"Counter value: {data['value']}")
    print(f"Time enabled: {data['time_enabled'] / 1e9:.3f} seconds")
    print(f"Time running: {data['time_running'] / 1e9:.3f} seconds")

    # Calculate average rate
    if data['time_running'] > 0:
        rate = data['value'] / (data['time_running'] / 1e9)
        print(f"Average rate: {rate:.2f} events/second")

    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_STOP)
    amdsmi.amdsmi_gpu_destroy_counter(counter)
    amdsmi.amdsmi_shut_down()
    ```
    """

Counter Destruction

Release performance counter resources.

def amdsmi_gpu_destroy_counter(event_handle: amdsmi_event_handle_t) -> None:
    """
    Destroy a performance counter and release its resources.

    Frees the hardware counter resource allocated by amdsmi_gpu_create_counter.
    The counter should be stopped before destruction. After calling this function,
    the event_handle becomes invalid and should not be used.

    Parameters:
    - event_handle (amdsmi_event_handle_t): Handle returned by amdsmi_gpu_create_counter

    Returns:
    - None: Function completes successfully

    Raises:
    - AmdSmiParameterException: If event_handle is invalid
    - AmdSmiLibraryException: If destruction fails

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiEventType, AmdSmiCounterCommand

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Create and use a counter
    counter = amdsmi.amdsmi_gpu_create_counter(
        device,
        AmdSmiEventType.XGMI_0_REQUEST_TX
    )
    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_START)
    # ... perform operations ...
    amdsmi.amdsmi_gpu_control_counter(counter, AmdSmiCounterCommand.CMD_STOP)

    # Clean up the counter
    amdsmi.amdsmi_gpu_destroy_counter(counter)
    print("Counter destroyed and resources released")

    amdsmi.amdsmi_shut_down()
    ```
    """

Enumerations

AmdSmiEventGroup

Event group classifications for performance counters.

class AmdSmiEventGroup(IntEnum):
    """
    Performance counter event group types.

    Event groups categorize related performance counters and determine
    the pool of available counter resources.

    Values:
    - XGMI: XGMI link activity counters (transactions, requests, responses)
    - XGMI_DATA_OUT: XGMI data transmission counters (outbound data volume)
    - GRP_INVALID: Invalid/unsupported group

    Example:
    ```python
    from amdsmi import AmdSmiEventGroup

    # Use XGMI group for link protocol counters
    group = AmdSmiEventGroup.XGMI

    # Use XGMI_DATA_OUT for data volume counters
    data_group = AmdSmiEventGroup.XGMI_DATA_OUT
    ```
    """
    XGMI = amdsmi_wrapper.AMDSMI_EVNT_GRP_XGMI
    XGMI_DATA_OUT = amdsmi_wrapper.AMDSMI_EVNT_GRP_XGMI_DATA_OUT
    GRP_INVALID = amdsmi_wrapper.AMDSMI_EVNT_GRP_INVALID

AmdSmiEventType

Specific event types that can be monitored.

class AmdSmiEventType(IntEnum):
    """
    Specific performance counter event types.

    Defines individual hardware events that can be monitored via performance
    counters. Events are organized by XGMI link and transaction type.

    XGMI Link 0 Events:
    - XGMI_0_NOP_TX: NOP (No Operation) transactions transmitted on link 0
    - XGMI_0_REQUEST_TX: Request packets transmitted on link 0
    - XGMI_0_RESPONSE_TX: Response packets transmitted on link 0
    - XGMI_0_BEATS_TX: Data beats (transfer units) transmitted on link 0

    XGMI Link 1 Events:
    - XGMI_1_NOP_TX: NOP transactions transmitted on link 1
    - XGMI_1_REQUEST_TX: Request packets transmitted on link 1
    - XGMI_1_RESPONSE_TX: Response packets transmitted on link 1
    - XGMI_1_BEATS_TX: Data beats transmitted on link 1

    XGMI Data Output Events:
    - XGMI_DATA_OUT_0 through XGMI_DATA_OUT_5: Data transmitted on links 0-5

    Example:
    ```python
    from amdsmi import AmdSmiEventType

    # Monitor requests on link 0
    event = AmdSmiEventType.XGMI_0_REQUEST_TX

    # Monitor data volume on link 1
    data_event = AmdSmiEventType.XGMI_DATA_OUT_1
    ```
    """
    XGMI_0_NOP_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_0_NOP_TX
    XGMI_0_REQUEST_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_0_REQUEST_TX
    XGMI_0_RESPONSE_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_0_RESPONSE_TX
    XGMI_0_BEATS_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_0_BEATS_TX
    XGMI_1_NOP_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_1_NOP_TX
    XGMI_1_REQUEST_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_1_REQUEST_TX
    XGMI_1_RESPONSE_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_1_RESPONSE_TX
    XGMI_1_BEATS_TX = amdsmi_wrapper.AMDSMI_EVNT_XGMI_1_BEATS_TX
    XGMI_DATA_OUT_0 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_0
    XGMI_DATA_OUT_1 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_1
    XGMI_DATA_OUT_2 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_2
    XGMI_DATA_OUT_3 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_3
    XGMI_DATA_OUT_4 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_4
    XGMI_DATA_OUT_5 = amdsmi_wrapper.AMDSMI_EVNT_XGMI_DATA_OUT_5

AmdSmiCounterCommand

Commands for controlling counter operation.

class AmdSmiCounterCommand(IntEnum):
    """
    Performance counter control commands.

    Commands used to start and stop performance counter data collection.

    Values:
    - CMD_START: Begin counter data collection
    - CMD_STOP: Stop counter data collection

    Example:
    ```python
    from amdsmi import AmdSmiCounterCommand

    # Start a counter
    start_cmd = AmdSmiCounterCommand.CMD_START

    # Stop a counter
    stop_cmd = AmdSmiCounterCommand.CMD_STOP
    ```
    """
    CMD_START = amdsmi_wrapper.AMDSMI_CNTR_CMD_START
    CMD_STOP = amdsmi_wrapper.AMDSMI_CNTR_CMD_STOP

Complete Example: Counter Lifecycle

This example demonstrates the full lifecycle of performance counters, from checking support through creation, usage, and cleanup.

import amdsmi
import time
from amdsmi import (
    AmdSmiEventGroup,
    AmdSmiEventType,
    AmdSmiCounterCommand
)

def monitor_xgmi_traffic(device, duration=5):
    """Monitor XGMI traffic on link 0 for specified duration."""

    # Check if XGMI counters are supported
    try:
        amdsmi.amdsmi_gpu_counter_group_supported(
            device,
            AmdSmiEventGroup.XGMI
        )
        print("XGMI performance counters are supported")
    except Exception as e:
        print(f"XGMI counters not supported: {e}")
        return

    # Check available counters
    available = amdsmi.amdsmi_get_gpu_available_counters(
        device,
        AmdSmiEventGroup.XGMI
    )
    print(f"Available XGMI counters: {available}")

    # Create counters for different event types
    counters = {}
    event_types = [
        ("NOP_TX", AmdSmiEventType.XGMI_0_NOP_TX),
        ("REQUEST_TX", AmdSmiEventType.XGMI_0_REQUEST_TX),
        ("RESPONSE_TX", AmdSmiEventType.XGMI_0_RESPONSE_TX),
        ("BEATS_TX", AmdSmiEventType.XGMI_0_BEATS_TX),
    ]

    print("\nCreating counters...")
    for name, event_type in event_types:
        try:
            handle = amdsmi.amdsmi_gpu_create_counter(device, event_type)
            counters[name] = handle
            print(f"  Created {name} counter")
        except Exception as e:
            print(f"  Failed to create {name} counter: {e}")

    if not counters:
        print("No counters created, exiting")
        return

    # Start all counters
    print("\nStarting counters...")
    for name, handle in counters.items():
        try:
            amdsmi.amdsmi_gpu_control_counter(
                handle,
                AmdSmiCounterCommand.CMD_START
            )
            print(f"  Started {name} counter")
        except Exception as e:
            print(f"  Failed to start {name} counter: {e}")

    # Monitor for specified duration
    print(f"\nCollecting data for {duration} seconds...")
    time.sleep(duration)

    # Stop all counters
    print("\nStopping counters...")
    for name, handle in counters.items():
        try:
            amdsmi.amdsmi_gpu_control_counter(
                handle,
                AmdSmiCounterCommand.CMD_STOP
            )
        except Exception as e:
            print(f"  Failed to stop {name} counter: {e}")

    # Read and display results
    print("\nCounter Results:")
    print("-" * 70)
    for name, handle in counters.items():
        try:
            data = amdsmi.amdsmi_gpu_read_counter(handle)
            time_sec = data['time_running'] / 1e9
            rate = data['value'] / time_sec if time_sec > 0 else 0

            print(f"{name:15} Value: {data['value']:15,d}")
            print(f"                Time enabled: {data['time_enabled'] / 1e9:10.3f} sec")
            print(f"                Time running: {time_sec:10.3f} sec")
            print(f"                Average rate: {rate:15,.2f} events/sec")
            print()
        except Exception as e:
            print(f"  Failed to read {name} counter: {e}")

    # Clean up all counters
    print("Cleaning up counters...")
    for name, handle in counters.items():
        try:
            amdsmi.amdsmi_gpu_destroy_counter(handle)
            print(f"  Destroyed {name} counter")
        except Exception as e:
            print(f"  Failed to destroy {name} counter: {e}")


def main():
    """Main function to demonstrate performance counter usage."""
    try:
        # Initialize library
        amdsmi.amdsmi_init()
        print("AMD SMI library initialized\n")

        # Get GPU devices
        devices = amdsmi.amdsmi_get_processor_handles()
        if not devices:
            print("No AMD GPU devices found")
            return

        print(f"Found {len(devices)} GPU device(s)\n")

        # Monitor the first device
        device = devices[0]
        bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
        print(f"Monitoring device: {bdf}")
        print("=" * 70)

        monitor_xgmi_traffic(device, duration=5)

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Cleanup
        try:
            amdsmi.amdsmi_shut_down()
            print("\nAMD SMI library shut down")
        except:
            pass


if __name__ == "__main__":
    main()

Usage Notes

Counter Resource Management

  1. Check Support First: Always verify counter group support before creating counters
  2. Limited Resources: Hardware has a finite number of performance counters available
  3. Check Availability: Use amdsmi_get_gpu_available_counters() to determine limits
  4. Clean Up: Always destroy counters with amdsmi_gpu_destroy_counter() when done

Counter Operation

  1. Start Before Reading: Counters must be started before they accumulate data
  2. Stop When Done: Stop counters to finalize data collection
  3. Read Anytime: Counters can be read while running for incremental measurements
  4. Timing Information: Use time_running for accurate rate calculations

XGMI Counters

Performance counters primarily monitor XGMI (AMD's high-speed GPU-to-GPU interconnect):

  • Link Activity: Track protocol-level traffic (NOPs, requests, responses)
  • Data Volume: Monitor actual data transmission (beats, bytes)
  • Multi-Link: Separate counters for different physical links (0-5)
  • Multi-GPU: Essential for profiling multi-GPU workloads and communication patterns

Best Practices

  1. Error Handling: Always wrap counter operations in try/except blocks
  2. Context Managers: Consider implementing context managers for automatic cleanup
  3. Baseline Measurements: Take baseline readings before workload execution
  4. Sampling Strategy: Use appropriate duration for stable measurements
  5. Resource Cleanup: Use try/finally or context managers to ensure cleanup

Limitations

  1. XGMI Only: Current counter support is limited to XGMI events
  2. Hardware Dependent: Counter availability varies by GPU model
  3. Multi-GPU Systems: XGMI counters only relevant in multi-GPU configurations
  4. Profiling Overhead: Counter operations may have minimal performance impact

Related Functions

  • XGMI Information: amdsmi_get_xgmi_info() - Get XGMI configuration details
  • Link Metrics: amdsmi_get_link_metrics() - Get bandwidth and other link metrics
  • Topology: amdsmi_topo_get_link_type() - Identify link types between processors
  • Utilization: amdsmi_get_utilization_count() - Higher-level utilization metrics