or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

examples

edge-cases.mdreal-world-scenarios.md
index.md
tile.json

gpu-temperature.mddocs/reference/

GPU Temperature and Thermal Monitoring

Functions for monitoring GPU temperature and voltage metrics. These functions provide access to various thermal sensors and voltage rails on AMD GPUs, including edge temperatures, hotspot temperatures, HBM memory temperatures, and voltage readings for different power domains.

Capabilities

Get Temperature Metric

Query temperature readings from various GPU thermal sensors.

def amdsmi_get_temp_metric(
    processor_handle: processor_handle,
    sensor_type: AmdSmiTemperatureType,
    metric: AmdSmiTemperatureMetric
) -> int:
    """
    Get temperature metric for a specific sensor on the GPU.

    Retrieves temperature readings from various thermal sensors on the GPU, including
    GPU edge, hotspot (junction), VRAM, HBM memory, PLX switch, and board components.
    Temperature values are returned in millidegrees Celsius (m°C).

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query
    - sensor_type (AmdSmiTemperatureType): The temperature sensor to read from.
      Common sensor types:
        - EDGE: GPU edge temperature (die edge)
        - HOTSPOT: GPU hotspot/junction temperature (highest die temperature)
        - JUNCTION: GPU junction temperature (similar to hotspot)
        - VRAM: VRAM temperature
        - HBM_0, HBM_1, HBM_2, HBM_3: Individual HBM memory stack temperatures
        - PLX: PLX switch temperature
        - MEM: Memory temperature
        - GPUBOARD_*: Various board component temperatures
        - BASEBOARD_*: Baseboard system temperatures
    - metric (AmdSmiTemperatureMetric): The temperature metric to retrieve.
      Available metrics:
        - CURRENT: Current temperature reading
        - MAX: Maximum temperature threshold
        - MIN: Minimum temperature threshold
        - MAX_HYST: Maximum threshold hysteresis
        - MIN_HYST: Minimum threshold hysteresis
        - CRITICAL: Critical temperature threshold
        - CRITICAL_HYST: Critical threshold hysteresis
        - EMERGENCY: Emergency temperature threshold
        - EMERGENCY_HYST: Emergency threshold hysteresis
        - CRIT_MIN: Critical minimum temperature threshold
        - CRIT_MIN_HYST: Critical minimum threshold hysteresis
        - OFFSET: Temperature offset value
        - LOWEST: Lowest recorded temperature
        - HIGHEST: Highest recorded temperature

    Returns:
    - int: Temperature value in millidegrees Celsius (m°C).
      Divide by 1000 to get degrees Celsius.

    Raises:
    - AmdSmiParameterException: If any parameter is invalid
    - AmdSmiLibraryException: If unable to retrieve temperature or sensor not supported

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get current edge temperature
        edge_temp = amdsmi.amdsmi_get_temp_metric(
            gpu,
            amdsmi.AmdSmiTemperatureType.EDGE,
            amdsmi.AmdSmiTemperatureMetric.CURRENT
        )
        print(f"GPU Edge Temperature: {edge_temp / 1000.0}°C")

        # Get hotspot temperature
        hotspot_temp = amdsmi.amdsmi_get_temp_metric(
            gpu,
            amdsmi.AmdSmiTemperatureType.HOTSPOT,
            amdsmi.AmdSmiTemperatureMetric.CURRENT
        )
        print(f"GPU Hotspot Temperature: {hotspot_temp / 1000.0}°C")

        # Get critical temperature threshold
        crit_temp = amdsmi.amdsmi_get_temp_metric(
            gpu,
            amdsmi.AmdSmiTemperatureType.EDGE,
            amdsmi.AmdSmiTemperatureMetric.CRITICAL
        )
        print(f"Critical Temperature: {crit_temp / 1000.0}°C")

        # Get VRAM temperature
        vram_temp = amdsmi.amdsmi_get_temp_metric(
            gpu,
            amdsmi.AmdSmiTemperatureType.VRAM,
            amdsmi.AmdSmiTemperatureMetric.CURRENT
        )
        print(f"VRAM Temperature: {vram_temp / 1000.0}°C")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Get GPU Voltage Metric

Query voltage readings from GPU power rails.

def amdsmi_get_gpu_volt_metric(
    processor_handle: processor_handle,
    sensor_type: AmdSmiVoltageType,
    metric: AmdSmiVoltageMetric
) -> int:
    """
    Get voltage metric for a specific voltage rail on the GPU.

    Retrieves voltage readings from GPU power rails. Voltage values are returned
    in millivolts (mV).

    Parameters:
    - processor_handle (processor_handle): Handle for the GPU device to query
    - sensor_type (AmdSmiVoltageType): The voltage rail to read from.
      Available voltage types:
        - VDDGFX: GPU graphics voltage (core voltage)
        - VDDBOARD: Board voltage
        - INVALID: Invalid voltage type (used for error handling)
    - metric (AmdSmiVoltageMetric): The voltage metric to retrieve.
      Available metrics:
        - CURRENT: Current voltage reading
        - MAX: Maximum voltage threshold
        - MIN: Minimum voltage threshold
        - MIN_CRIT: Critical minimum voltage threshold
        - MAX_CRIT: Critical maximum voltage threshold
        - AVERAGE: Average voltage
        - LOWEST: Lowest recorded voltage
        - HIGHEST: Highest recorded voltage

    Returns:
    - int: Voltage value in millivolts (mV).
      Divide by 1000 to get volts (V).

    Raises:
    - AmdSmiParameterException: If any parameter is invalid
    - AmdSmiLibraryException: If unable to retrieve voltage or sensor not supported

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    try:
        devices = amdsmi.amdsmi_get_processor_handles()
        gpu = devices[0]

        # Get current GPU core voltage
        vddgfx = amdsmi.amdsmi_get_gpu_volt_metric(
            gpu,
            amdsmi.AmdSmiVoltageType.VDDGFX,
            amdsmi.AmdSmiVoltageMetric.CURRENT
        )
        print(f"GPU Core Voltage (VDDGFX): {vddgfx / 1000.0}V ({vddgfx}mV)")

        # Get average voltage
        avg_voltage = amdsmi.amdsmi_get_gpu_volt_metric(
            gpu,
            amdsmi.AmdSmiVoltageType.VDDGFX,
            amdsmi.AmdSmiVoltageMetric.AVERAGE
        )
        print(f"Average Voltage: {avg_voltage / 1000.0}V")

        # Get board voltage
        board_voltage = amdsmi.amdsmi_get_gpu_volt_metric(
            gpu,
            amdsmi.AmdSmiVoltageType.VDDBOARD,
            amdsmi.AmdSmiVoltageMetric.CURRENT
        )
        print(f"Board Voltage: {board_voltage / 1000.0}V")

    finally:
        amdsmi.amdsmi_shut_down()
    ```
    """

Enumerations

Temperature Sensor Types

class AmdSmiTemperatureType(IntEnum):
    """
    Temperature sensor types available on AMD GPUs.

    These enums identify different thermal sensors on the GPU and board components.
    Not all sensors are available on all GPU models.
    """
    # Primary GPU sensors
    EDGE = ...                    # GPU edge temperature (die edge)
    HOTSPOT = ...                 # GPU hotspot/junction temperature
    JUNCTION = ...                # GPU junction temperature
    VRAM = ...                    # VRAM temperature

    # HBM memory stack sensors (for GPUs with HBM memory)
    HBM_0 = ...                   # HBM stack 0 temperature
    HBM_1 = ...                   # HBM stack 1 temperature
    HBM_2 = ...                   # HBM stack 2 temperature
    HBM_3 = ...                   # HBM stack 3 temperature

    # PCIe and interconnect
    PLX = ...                     # PLX switch temperature

    # GPU board node sensors
    GPUBOARD_NODE_RETIMER_X = ...                # Retimer X temperature
    GPUBOARD_NODE_OAM_X_IBC = ...                # OAM X IBC temperature
    GPUBOARD_NODE_OAM_X_IBC_2 = ...              # OAM X IBC 2 temperature
    GPUBOARD_NODE_OAM_X_VDD18_VR = ...           # OAM X 1.8V VR temperature
    GPUBOARD_NODE_OAM_X_04_HBM_B_VR = ...        # OAM X 0.4V HBM B VR temperature
    GPUBOARD_NODE_OAM_X_04_HBM_D_VR = ...        # OAM X 0.4V HBM D VR temperature
    GPUBOARD_NODE_LAST = ...                     # Last GPU board node sensor

    # GPU board voltage regulator sensors
    GPUBOARD_VDDCR_VDD0 = ...                    # VDDCR VDD0 VR temperature
    GPUBOARD_VDDCR_VDD1 = ...                    # VDDCR VDD1 VR temperature
    GPUBOARD_VDDCR_VDD2 = ...                    # VDDCR VDD2 VR temperature
    GPUBOARD_VDDCR_VDD3 = ...                    # VDDCR VDD3 VR temperature
    GPUBOARD_VDDCR_SOC_A = ...                   # VDDCR SOC A VR temperature
    GPUBOARD_VDDCR_SOC_C = ...                   # VDDCR SOC C VR temperature
    GPUBOARD_VDDCR_SOCIO_A = ...                 # VDDCR SOCIO A VR temperature
    GPUBOARD_VDDCR_SOCIO_C = ...                 # VDDCR SOCIO C VR temperature
    GPUBOARD_VDD_085_HBM = ...                   # VDD 0.85V HBM VR temperature
    GPUBOARD_VDDCR_11_HBM_B = ...                # VDDCR 1.1V HBM B VR temperature
    GPUBOARD_VDDCR_11_HBM_D = ...                # VDDCR 1.1V HBM D VR temperature
    GPUBOARD_VDD_USR = ...                       # VDD USR VR temperature
    GPUBOARD_VDDIO_11_E32 = ...                  # VDDIO 1.1V E32 VR temperature
    GPUBOARD_VR_LAST = ...                       # Last GPU board VR sensor

    # Baseboard system sensors
    BASEBOARD_UBB_FPGA = ...                     # UBB FPGA temperature
    BASEBOARD_UBB_FRONT = ...                    # UBB front temperature
    BASEBOARD_UBB_BACK = ...                     # UBB back temperature
    BASEBOARD_UBB_OAM7 = ...                     # UBB OAM7 temperature
    BASEBOARD_UBB_IBC = ...                      # UBB IBC temperature
    BASEBOARD_UBB_UFPGA = ...                    # UBB UFPGA temperature
    BASEBOARD_UBB_OAM1 = ...                     # UBB OAM1 temperature
    BASEBOARD_OAM_0_1_HSC = ...                  # OAM 0-1 HSC temperature
    BASEBOARD_OAM_2_3_HSC = ...                  # OAM 2-3 HSC temperature
    BASEBOARD_OAM_4_5_HSC = ...                  # OAM 4-5 HSC temperature
    BASEBOARD_OAM_6_7_HSC = ...                  # OAM 6-7 HSC temperature
    BASEBOARD_UBB_FPGA_0V72_VR = ...             # UBB FPGA 0.72V VR temperature
    BASEBOARD_UBB_FPGA_3V3_VR = ...              # UBB FPGA 3.3V VR temperature
    BASEBOARD_RETIMER_0_1_2_3_1V2_VR = ...       # Retimer 0-1-2-3 1.2V VR temperature
    BASEBOARD_RETIMER_4_5_6_7_1V2_VR = ...       # Retimer 4-5-6-7 1.2V VR temperature
    BASEBOARD_RETIMER_0_1_0V9_VR = ...           # Retimer 0-1 0.9V VR temperature
    BASEBOARD_RETIMER_4_5_0V9_VR = ...           # Retimer 4-5 0.9V VR temperature
    BASEBOARD_RETIMER_2_3_0V9_VR = ...           # Retimer 2-3 0.9V VR temperature
    BASEBOARD_RETIMER_6_7_0V9_VR = ...           # Retimer 6-7 0.9V VR temperature
    BASEBOARD_OAM_0_1_2_3_3V3_VR = ...           # OAM 0-1-2-3 3.3V VR temperature
    BASEBOARD_OAM_4_5_6_7_3V3_VR = ...           # OAM 4-5-6-7 3.3V VR temperature
    BASEBOARD_IBC_HSC = ...                      # IBC HSC temperature
    BASEBOARD_IBC = ...                          # IBC temperature
    BASEBOARD_LAST = ...                         # Last baseboard sensor
    BASEBOARD__MAX = ...                         # Maximum temperature type value

Temperature Metric Types

class AmdSmiTemperatureMetric(IntEnum):
    """
    Temperature metric types for querying different temperature values.

    These metrics specify what type of temperature value to retrieve for a
    given sensor.
    """
    CURRENT = ...                 # Current temperature reading
    MAX = ...                     # Maximum temperature threshold
    MIN = ...                     # Minimum temperature threshold
    MAX_HYST = ...                # Maximum threshold hysteresis
    MIN_HYST = ...                # Minimum threshold hysteresis
    CRITICAL = ...                # Critical temperature threshold
    CRITICAL_HYST = ...           # Critical threshold hysteresis
    EMERGENCY = ...               # Emergency temperature threshold
    EMERGENCY_HYST = ...          # Emergency threshold hysteresis
    CRIT_MIN = ...                # Critical minimum temperature threshold
    CRIT_MIN_HYST = ...           # Critical minimum threshold hysteresis
    OFFSET = ...                  # Temperature offset value
    LOWEST = ...                  # Lowest recorded temperature
    HIGHEST = ...                 # Highest recorded temperature

Voltage Rail Types

class AmdSmiVoltageType(IntEnum):
    """
    Voltage rail types available on AMD GPUs.

    These enums identify different voltage rails that can be monitored.
    """
    VDDGFX = ...                  # GPU graphics core voltage
    VDDBOARD = ...                # Board voltage
    INVALID = ...                 # Invalid voltage type

Voltage Metric Types

class AmdSmiVoltageMetric(IntEnum):
    """
    Voltage metric types for querying different voltage values.

    These metrics specify what type of voltage value to retrieve for a
    given voltage rail.
    """
    CURRENT = ...                 # Current voltage reading
    MAX = ...                     # Maximum voltage threshold
    MIN = ...                     # Minimum voltage threshold
    MIN_CRIT = ...                # Critical minimum voltage threshold
    MAX_CRIT = ...                # Critical maximum voltage threshold
    AVERAGE = ...                 # Average voltage
    LOWEST = ...                  # Lowest recorded voltage
    HIGHEST = ...                 # Highest recorded voltage

Usage Patterns

Basic Temperature Monitoring

Monitor essential GPU temperatures:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, gpu in enumerate(devices):
        print(f"\nGPU {i} Temperatures:")

        # Edge temperature (most common)
        edge_temp = amdsmi.amdsmi_get_temp_metric(
            gpu,
            amdsmi.AmdSmiTemperatureType.EDGE,
            amdsmi.AmdSmiTemperatureMetric.CURRENT
        )
        print(f"  Edge: {edge_temp / 1000.0:.1f}°C")

        # Hotspot temperature (highest die temperature)
        try:
            hotspot_temp = amdsmi.amdsmi_get_temp_metric(
                gpu,
                amdsmi.AmdSmiTemperatureType.HOTSPOT,
                amdsmi.AmdSmiTemperatureMetric.CURRENT
            )
            print(f"  Hotspot: {hotspot_temp / 1000.0:.1f}°C")
        except amdsmi.AmdSmiLibraryException:
            print("  Hotspot: Not available")

        # Memory temperature
        try:
            mem_temp = amdsmi.amdsmi_get_temp_metric(
                gpu,
                amdsmi.AmdSmiTemperatureType.VRAM,
                amdsmi.AmdSmiTemperatureMetric.CURRENT
            )
            print(f"  VRAM: {mem_temp / 1000.0:.1f}°C")
        except amdsmi.AmdSmiLibraryException:
            print("  VRAM: Not available")

finally:
    amdsmi.amdsmi_shut_down()

Temperature Threshold Monitoring

Check temperatures against critical thresholds:

import amdsmi

def check_thermal_status(gpu_handle):
    """Check if GPU is approaching thermal limits."""
    # Get current and critical temperatures
    current = amdsmi.amdsmi_get_temp_metric(
        gpu_handle,
        amdsmi.AmdSmiTemperatureType.EDGE,
        amdsmi.AmdSmiTemperatureMetric.CURRENT
    )

    try:
        critical = amdsmi.amdsmi_get_temp_metric(
            gpu_handle,
            amdsmi.AmdSmiTemperatureType.EDGE,
            amdsmi.AmdSmiTemperatureMetric.CRITICAL
        )

        temp_c = current / 1000.0
        crit_c = critical / 1000.0
        margin = crit_c - temp_c

        print(f"Temperature: {temp_c:.1f}°C")
        print(f"Critical: {crit_c:.1f}°C")
        print(f"Margin: {margin:.1f}°C")

        if margin < 10:
            print("WARNING: Approaching critical temperature!")
            return False
        return True

    except amdsmi.AmdSmiLibraryException:
        print("Critical threshold not available")
        return True

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    for i, gpu in enumerate(devices):
        print(f"\nGPU {i} Thermal Status:")
        check_thermal_status(gpu)
finally:
    amdsmi.amdsmi_shut_down()

HBM Memory Temperature Monitoring

Monitor individual HBM memory stack temperatures:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    print("HBM Memory Temperatures:")

    # Query each HBM stack
    hbm_sensors = [
        amdsmi.AmdSmiTemperatureType.HBM_0,
        amdsmi.AmdSmiTemperatureType.HBM_1,
        amdsmi.AmdSmiTemperatureType.HBM_2,
        amdsmi.AmdSmiTemperatureType.HBM_3,
    ]

    for i, sensor in enumerate(hbm_sensors):
        try:
            temp = amdsmi.amdsmi_get_temp_metric(
                gpu,
                sensor,
                amdsmi.AmdSmiTemperatureMetric.CURRENT
            )
            print(f"  HBM Stack {i}: {temp / 1000.0:.1f}°C")
        except amdsmi.AmdSmiLibraryException:
            # HBM stack not present or not monitored
            pass

finally:
    amdsmi.amdsmi_shut_down()

Voltage Monitoring

Monitor GPU voltage rails:

import amdsmi

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, gpu in enumerate(devices):
        print(f"\nGPU {i} Voltages:")

        # GPU core voltage
        try:
            vddgfx = amdsmi.amdsmi_get_gpu_volt_metric(
                gpu,
                amdsmi.AmdSmiVoltageType.VDDGFX,
                amdsmi.AmdSmiVoltageMetric.CURRENT
            )
            print(f"  Core (VDDGFX): {vddgfx / 1000.0:.3f}V ({vddgfx}mV)")

            # Get average voltage
            avg_volt = amdsmi.amdsmi_get_gpu_volt_metric(
                gpu,
                amdsmi.AmdSmiVoltageType.VDDGFX,
                amdsmi.AmdSmiVoltageMetric.AVERAGE
            )
            print(f"  Average: {avg_volt / 1000.0:.3f}V")

        except amdsmi.AmdSmiLibraryException as e:
            print(f"  Core voltage: Not available")

        # Board voltage
        try:
            vddboard = amdsmi.amdsmi_get_gpu_volt_metric(
                gpu,
                amdsmi.AmdSmiVoltageType.VDDBOARD,
                amdsmi.AmdSmiVoltageMetric.CURRENT
            )
            print(f"  Board: {vddboard / 1000.0:.3f}V ({vddboard}mV)")
        except amdsmi.AmdSmiLibraryException:
            print(f"  Board voltage: Not available")

finally:
    amdsmi.amdsmi_shut_down()

Combined Temperature and Voltage Monitoring

Create a comprehensive thermal and power monitoring dashboard:

import amdsmi
import time

def print_thermal_status(gpu_handle, gpu_id):
    """Print comprehensive thermal and voltage status."""
    print(f"\n{'='*60}")
    print(f"GPU {gpu_id} - Thermal and Voltage Status")
    print(f"{'='*60}")

    # Temperature readings
    print("\nTemperatures:")
    temp_sensors = {
        "Edge": amdsmi.AmdSmiTemperatureType.EDGE,
        "Hotspot": amdsmi.AmdSmiTemperatureType.HOTSPOT,
        "VRAM": amdsmi.AmdSmiTemperatureType.VRAM,
    }

    for name, sensor in temp_sensors.items():
        try:
            temp = amdsmi.amdsmi_get_temp_metric(
                gpu_handle,
                sensor,
                amdsmi.AmdSmiTemperatureMetric.CURRENT
            )
            print(f"  {name:12s}: {temp / 1000.0:6.1f}°C")
        except amdsmi.AmdSmiLibraryException:
            print(f"  {name:12s}: N/A")

    # Voltage readings
    print("\nVoltages:")
    volt_sensors = {
        "Core (VDDGFX)": amdsmi.AmdSmiVoltageType.VDDGFX,
        "Board": amdsmi.AmdSmiVoltageType.VDDBOARD,
    }

    for name, sensor in volt_sensors.items():
        try:
            volt = amdsmi.amdsmi_get_gpu_volt_metric(
                gpu_handle,
                sensor,
                amdsmi.AmdSmiVoltageMetric.CURRENT
            )
            print(f"  {name:15s}: {volt / 1000.0:6.3f}V ({volt:5d}mV)")
        except amdsmi.AmdSmiLibraryException:
            print(f"  {name:15s}: N/A")

amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()

    # Monitor continuously
    while True:
        for i, gpu in enumerate(devices):
            print_thermal_status(gpu, i)

        print(f"\n{'='*60}")
        print("Press Ctrl+C to exit...")
        time.sleep(5)

except KeyboardInterrupt:
    print("\nMonitoring stopped.")
finally:
    amdsmi.amdsmi_shut_down()

Temperature History Tracking

Track temperature statistics over time:

import amdsmi
import time
from collections import defaultdict

class TemperatureTracker:
    """Track GPU temperature statistics."""

    def __init__(self, gpu_handle):
        self.gpu_handle = gpu_handle
        self.readings = defaultdict(list)

    def record(self):
        """Record current temperature readings."""
        sensors = {
            "edge": amdsmi.AmdSmiTemperatureType.EDGE,
            "hotspot": amdsmi.AmdSmiTemperatureType.HOTSPOT,
            "vram": amdsmi.AmdSmiTemperatureType.VRAM,
        }

        for name, sensor in sensors.items():
            try:
                temp = amdsmi.amdsmi_get_temp_metric(
                    self.gpu_handle,
                    sensor,
                    amdsmi.AmdSmiTemperatureMetric.CURRENT
                )
                self.readings[name].append(temp / 1000.0)
            except amdsmi.AmdSmiLibraryException:
                pass

    def get_statistics(self):
        """Get temperature statistics."""
        stats = {}
        for name, temps in self.readings.items():
            if temps:
                stats[name] = {
                    "current": temps[-1],
                    "min": min(temps),
                    "max": max(temps),
                    "avg": sum(temps) / len(temps),
                    "samples": len(temps),
                }
        return stats

# Example usage
amdsmi.amdsmi_init()
try:
    devices = amdsmi.amdsmi_get_processor_handles()
    gpu = devices[0]

    tracker = TemperatureTracker(gpu)

    # Collect samples for 60 seconds
    print("Collecting temperature data for 60 seconds...")
    for _ in range(60):
        tracker.record()
        time.sleep(1)

    # Print statistics
    print("\nTemperature Statistics:")
    print(f"{'Sensor':<12} {'Current':>8} {'Min':>8} {'Max':>8} {'Avg':>8} {'Samples':>8}")
    print("-" * 64)

    stats = tracker.get_statistics()
    for sensor, data in stats.items():
        print(f"{sensor:<12} "
              f"{data['current']:>7.1f}°C "
              f"{data['min']:>7.1f}°C "
              f"{data['max']:>7.1f}°C "
              f"{data['avg']:>7.1f}°C "
              f"{data['samples']:>8d}")

finally:
    amdsmi.amdsmi_shut_down()

Notes

Temperature Values

  • All temperature values are returned in millidegrees Celsius (m°C)
  • Divide the returned value by 1000 to convert to degrees Celsius
  • Example: A return value of 55000 represents 55.0°C

Voltage Values

  • All voltage values are returned in millivolts (mV)
  • Divide the returned value by 1000 to convert to volts (V)
  • Example: A return value of 1150 represents 1.150V

Sensor Availability

  • Not all sensors are available on all GPU models
  • Attempting to read an unavailable sensor will raise AmdSmiLibraryException
  • Use try-except blocks to handle sensors that may not be present
  • Common sensors (EDGE, HOTSPOT, VRAM) are available on most modern AMD GPUs

Temperature Sensor Types

  • EDGE: Die edge temperature, typically the coolest part of the GPU die
  • HOTSPOT: Highest temperature on the GPU die (junction temperature)
  • JUNCTION: Similar to hotspot, represents the peak die temperature
  • VRAM: Video memory temperature
  • HBM_0/1/2/3: Individual HBM memory stack temperatures (HBM GPUs only)
  • PLX: PCIe switch temperature (if present)
  • Board sensors: Various board components like voltage regulators, retimers, etc.

Voltage Rail Types

  • VDDGFX: GPU core voltage, the primary voltage rail for the graphics engine
  • VDDBOARD: Board-level voltage supply
  • Voltage readings provide insight into power delivery and can help diagnose power-related issues

Metric Types

  • CURRENT: Real-time instantaneous reading (most commonly used)
  • MAX/MIN: Hardware-defined threshold values
  • CRITICAL/EMERGENCY: Thermal limit thresholds that may trigger throttling or shutdown
  • LOWEST/HIGHEST: Recorded minimum and maximum values since driver load
  • AVERAGE: Time-averaged value (when available)
  • HYST: Hysteresis values for threshold crossings

Best Practices

  • Query the most common sensors (EDGE, HOTSPOT, VRAM) for typical monitoring
  • Monitor HOTSPOT temperature for thermal throttling risk
  • Check CRITICAL thresholds to understand thermal headroom
  • Use try-except for optional sensors to ensure code robustness
  • Be aware that sensor availability varies by GPU model and driver version
  • Consider polling intervals: 1-5 seconds is typical for monitoring applications
  • Temperature readings have some latency; very rapid polling may show repeated values

Performance Considerations

  • Each sensor query requires a hardware read operation
  • For frequent monitoring, consider batching sensor reads together
  • Cache threshold values (CRITICAL, MAX) as they typically don't change
  • Avoid excessive polling rates (< 100ms intervals) to reduce overhead

Relationship to Other Monitoring Functions

  • amdsmi_get_gpu_metrics_info(): Provides comprehensive metrics including temperatures in a single call
  • amdsmi_get_power_info(): Power information complements thermal monitoring
  • Fan speed functions: Used together with temperature for cooling management
  • amdsmi_get_gpu_activity(): GPU utilization affects thermal output