or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

examples

edge-cases.mdreal-world-scenarios.md
index.md
tile.json

gpu-memory.mddocs/reference/

GPU Memory Management

Functions for querying GPU memory information including total capacity, current usage, and reserved/retired memory pages. These functions provide visibility into different memory types (VRAM, visible VRAM, GTT) and memory health status.

Capabilities

Query Total Memory

Get the total memory capacity for a specific memory type.

def amdsmi_get_gpu_memory_total(
    processor_handle: processor_handle,
    mem_type: AmdSmiMemoryType
) -> int:
    """
    Get total GPU memory capacity for a specific memory type.

    Returns the total amount of memory available in the specified memory pool.
    This represents the hardware capacity, not the amount currently available.

    Parameters:
    - processor_handle: Handle for the target GPU device
    - mem_type (AmdSmiMemoryType): Type of memory to query:
      - VRAM: Video RAM (total GPU memory)
      - VIS_VRAM: CPU-visible VRAM (BAR memory that CPU can directly access)
      - GTT: Graphics Translation Table (system memory accessible by GPU)

    Returns:
    - int: Total memory capacity in bytes

    Raises:
    - AmdSmiParameterException: If processor_handle or mem_type is invalid
    - AmdSmiLibraryException: On query failure

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiMemoryType

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Query total VRAM
    vram_total = amdsmi.amdsmi_get_gpu_memory_total(
        device, AmdSmiMemoryType.VRAM
    )
    print(f"Total VRAM: {vram_total / (1024**3):.2f} GB")

    # Query visible VRAM (CPU-accessible)
    vis_vram_total = amdsmi.amdsmi_get_gpu_memory_total(
        device, AmdSmiMemoryType.VIS_VRAM
    )
    print(f"Visible VRAM: {vis_vram_total / (1024**3):.2f} GB")

    # Query GTT memory
    gtt_total = amdsmi.amdsmi_get_gpu_memory_total(
        device, AmdSmiMemoryType.GTT
    )
    print(f"GTT Memory: {gtt_total / (1024**3):.2f} GB")

    amdsmi.amdsmi_shut_down()
    ```
    """

Query Memory Usage

Get the current memory usage for a specific memory type.

def amdsmi_get_gpu_memory_usage(
    processor_handle: processor_handle,
    mem_type: AmdSmiMemoryType
) -> int:
    """
    Get current GPU memory usage for a specific memory type.

    Returns the amount of memory currently in use from the specified memory pool.
    This can be used with amdsmi_get_gpu_memory_total() to calculate utilization.

    Parameters:
    - processor_handle: Handle for the target GPU device
    - mem_type (AmdSmiMemoryType): Type of memory to query:
      - VRAM: Video RAM usage
      - VIS_VRAM: CPU-visible VRAM usage
      - GTT: Graphics Translation Table usage

    Returns:
    - int: Used memory in bytes

    Raises:
    - AmdSmiParameterException: If processor_handle or mem_type is invalid
    - AmdSmiLibraryException: On query failure

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiMemoryType

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    # Query VRAM usage
    vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
        device, AmdSmiMemoryType.VRAM
    )
    vram_total = amdsmi.amdsmi_get_gpu_memory_total(
        device, AmdSmiMemoryType.VRAM
    )

    # Calculate utilization
    usage_gb = vram_used / (1024**3)
    total_gb = vram_total / (1024**3)
    usage_percent = (vram_used / vram_total) * 100

    print(f"VRAM Usage: {usage_gb:.2f} / {total_gb:.2f} GB ({usage_percent:.1f}%)")

    amdsmi.amdsmi_shut_down()
    ```
    """

Query Reserved Memory Pages

Get information about reserved (retired) memory pages due to errors.

def amdsmi_get_gpu_memory_reserved_pages(
    processor_handle: processor_handle
) -> List[Dict[str, Any]]:
    """
    Get list of reserved (retired) memory pages.

    Returns information about memory pages that have been reserved due to uncorrectable
    errors. These pages are removed from the available memory pool to prevent data
    corruption. This is an important health metric for GPU memory.

    Parameters:
    - processor_handle: Handle for the target GPU device

    Returns:
    - List[dict]: List of reserved page records, each containing:
      - value (int): Record index
      - page_address (int): Physical address of the reserved page
      - page_size (int): Size of the page in bytes
      - status (AmdSmiMemoryPageStatus): Page status:
        - RESERVED: Page has been reserved and removed from use
        - PENDING: Page is pending retirement
        - UNRESERVABLE: Page cannot be reserved

    An empty list indicates no reserved pages (healthy memory).

    Raises:
    - AmdSmiParameterException: If processor_handle is invalid
    - AmdSmiLibraryException: On query failure

    Example:
    ```python
    import amdsmi
    from amdsmi import AmdSmiMemoryPageStatus

    amdsmi.amdsmi_init()
    device = amdsmi.amdsmi_get_processor_handles()[0]

    reserved_pages = amdsmi.amdsmi_get_gpu_memory_reserved_pages(device)

    if not reserved_pages:
        print("No reserved pages - memory is healthy")
    else:
        print(f"Found {len(reserved_pages)} reserved pages:")
        for page in reserved_pages:
            addr = page['page_address']
            size = page['page_size']
            status = page['status']

            status_str = {
                AmdSmiMemoryPageStatus.RESERVED: "RESERVED",
                AmdSmiMemoryPageStatus.PENDING: "PENDING",
                AmdSmiMemoryPageStatus.UNRESERVABLE: "UNRESERVABLE"
            }.get(status, f"UNKNOWN({status})")

            print(f"  [{page['value']}] Address: 0x{addr:016x}, "
                  f"Size: {size} bytes, Status: {status_str}")

    amdsmi.amdsmi_shut_down()
    ```
    """

Usage Examples

Basic Memory Monitoring

Monitor memory usage across different memory types:

import amdsmi
from amdsmi import AmdSmiMemoryType

amdsmi.amdsmi_init()

try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, device in enumerate(devices):
        print(f"\n=== GPU {i} Memory Status ===")

        # VRAM (main GPU memory)
        vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
            device, AmdSmiMemoryType.VRAM
        )
        vram_total = amdsmi.amdsmi_get_gpu_memory_total(
            device, AmdSmiMemoryType.VRAM
        )

        vram_used_gb = vram_used / (1024**3)
        vram_total_gb = vram_total / (1024**3)
        vram_percent = (vram_used / vram_total) * 100

        print(f"VRAM: {vram_used_gb:.2f} / {vram_total_gb:.2f} GB "
              f"({vram_percent:.1f}%)")

        # Visible VRAM (CPU-accessible)
        try:
            vis_vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
                device, AmdSmiMemoryType.VIS_VRAM
            )
            vis_vram_total = amdsmi.amdsmi_get_gpu_memory_total(
                device, AmdSmiMemoryType.VIS_VRAM
            )

            vis_vram_used_gb = vis_vram_used / (1024**3)
            vis_vram_total_gb = vis_vram_total / (1024**3)
            vis_vram_percent = (vis_vram_used / vis_vram_total) * 100

            print(f"Visible VRAM: {vis_vram_used_gb:.2f} / {vis_vram_total_gb:.2f} GB "
                  f"({vis_vram_percent:.1f}%)")
        except:
            print("Visible VRAM: Not available")

        # GTT (system memory)
        try:
            gtt_used = amdsmi.amdsmi_get_gpu_memory_usage(
                device, AmdSmiMemoryType.GTT
            )
            gtt_total = amdsmi.amdsmi_get_gpu_memory_total(
                device, AmdSmiMemoryType.GTT
            )

            gtt_used_gb = gtt_used / (1024**3)
            gtt_total_gb = gtt_total / (1024**3)
            gtt_percent = (gtt_used / gtt_total) * 100

            print(f"GTT: {gtt_used_gb:.2f} / {gtt_total_gb:.2f} GB "
                  f"({gtt_percent:.1f}%)")
        except:
            print("GTT: Not available")

finally:
    amdsmi.amdsmi_shut_down()

Memory Health Check

Check for reserved pages indicating memory errors:

import amdsmi
from amdsmi import AmdSmiMemoryPageStatus

def check_memory_health(device):
    """Check GPU memory health status."""

    reserved_pages = amdsmi.amdsmi_get_gpu_memory_reserved_pages(device)

    if not reserved_pages:
        return {
            'status': 'HEALTHY',
            'reserved_count': 0,
            'pending_count': 0,
            'total_size': 0
        }

    # Count by status
    status_counts = {
        AmdSmiMemoryPageStatus.RESERVED: 0,
        AmdSmiMemoryPageStatus.PENDING: 0,
        AmdSmiMemoryPageStatus.UNRESERVABLE: 0
    }

    total_size = 0

    for page in reserved_pages:
        status = page['status']
        status_counts[status] = status_counts.get(status, 0) + 1
        total_size += page['page_size']

    # Determine overall health
    reserved_count = status_counts[AmdSmiMemoryPageStatus.RESERVED]
    pending_count = status_counts[AmdSmiMemoryPageStatus.PENDING]

    if reserved_count + pending_count > 10:
        health_status = 'DEGRADED'
    elif reserved_count + pending_count > 0:
        health_status = 'WARNING'
    else:
        health_status = 'HEALTHY'

    return {
        'status': health_status,
        'reserved_count': reserved_count,
        'pending_count': pending_count,
        'unreservable_count': status_counts[AmdSmiMemoryPageStatus.UNRESERVABLE],
        'total_size': total_size,
        'details': reserved_pages
    }

amdsmi.amdsmi_init()

try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, device in enumerate(devices):
        print(f"\n=== GPU {i} Memory Health ===")

        health = check_memory_health(device)

        print(f"Status: {health['status']}")
        print(f"Reserved pages: {health['reserved_count']}")
        print(f"Pending pages: {health['pending_count']}")

        if health['reserved_count'] > 0:
            size_mb = health['total_size'] / (1024**2)
            print(f"Total reserved: {size_mb:.2f} MB")
            print("\nDetailed page list:")
            for page in health['details']:
                print(f"  0x{page['page_address']:016x} - "
                      f"{page['page_size']} bytes - "
                      f"Status: {page['status']}")

finally:
    amdsmi.amdsmi_shut_down()

Memory Pressure Monitoring

Monitor memory pressure and alert on high usage:

import amdsmi
from amdsmi import AmdSmiMemoryType
import time

def get_memory_pressure(device):
    """Calculate memory pressure level."""

    vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
        device, AmdSmiMemoryType.VRAM
    )
    vram_total = amdsmi.amdsmi_get_gpu_memory_total(
        device, AmdSmiMemoryType.VRAM
    )

    usage_percent = (vram_used / vram_total) * 100

    if usage_percent > 95:
        return 'CRITICAL', usage_percent
    elif usage_percent > 85:
        return 'HIGH', usage_percent
    elif usage_percent > 70:
        return 'MODERATE', usage_percent
    else:
        return 'LOW', usage_percent

amdsmi.amdsmi_init()

try:
    device = amdsmi.amdsmi_get_processor_handles()[0]

    print("Monitoring memory pressure (Ctrl+C to stop)...")

    while True:
        pressure, percent = get_memory_pressure(device)

        # Get actual values
        vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
            device, AmdSmiMemoryType.VRAM
        )
        vram_total = amdsmi.amdsmi_get_gpu_memory_total(
            device, AmdSmiMemoryType.VRAM
        )

        used_gb = vram_used / (1024**3)
        total_gb = vram_total / (1024**3)
        available_gb = (vram_total - vram_used) / (1024**3)

        timestamp = time.strftime("%H:%M:%S")

        # Display with color coding
        if pressure == 'CRITICAL':
            print(f"[{timestamp}] !!! CRITICAL !!! "
                  f"{used_gb:.2f} / {total_gb:.2f} GB ({percent:.1f}%) "
                  f"- Only {available_gb:.2f} GB available")
        elif pressure == 'HIGH':
            print(f"[{timestamp}] ** HIGH ** "
                  f"{used_gb:.2f} / {total_gb:.2f} GB ({percent:.1f}%)")
        elif pressure == 'MODERATE':
            print(f"[{timestamp}] * MODERATE * "
                  f"{used_gb:.2f} / {total_gb:.2f} GB ({percent:.1f}%)")
        else:
            print(f"[{timestamp}] OK - "
                  f"{used_gb:.2f} / {total_gb:.2f} GB ({percent:.1f}%)")

        time.sleep(2)

except KeyboardInterrupt:
    print("\nMonitoring stopped")
finally:
    amdsmi.amdsmi_shut_down()

Comprehensive Memory Report

Generate a detailed memory report:

import amdsmi
from amdsmi import AmdSmiMemoryType, AmdSmiMemoryPageStatus

def generate_memory_report(device):
    """Generate comprehensive memory report for a GPU."""

    report = {
        'vram': {},
        'vis_vram': {},
        'gtt': {},
        'health': {}
    }

    # VRAM
    try:
        vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
            device, AmdSmiMemoryType.VRAM
        )
        vram_total = amdsmi.amdsmi_get_gpu_memory_total(
            device, AmdSmiMemoryType.VRAM
        )

        report['vram'] = {
            'used_bytes': vram_used,
            'total_bytes': vram_total,
            'available_bytes': vram_total - vram_used,
            'used_gb': vram_used / (1024**3),
            'total_gb': vram_total / (1024**3),
            'available_gb': (vram_total - vram_used) / (1024**3),
            'usage_percent': (vram_used / vram_total) * 100
        }
    except Exception as e:
        report['vram']['error'] = str(e)

    # Visible VRAM
    try:
        vis_vram_used = amdsmi.amdsmi_get_gpu_memory_usage(
            device, AmdSmiMemoryType.VIS_VRAM
        )
        vis_vram_total = amdsmi.amdsmi_get_gpu_memory_total(
            device, AmdSmiMemoryType.VIS_VRAM
        )

        report['vis_vram'] = {
            'used_bytes': vis_vram_used,
            'total_bytes': vis_vram_total,
            'available_bytes': vis_vram_total - vis_vram_used,
            'used_gb': vis_vram_used / (1024**3),
            'total_gb': vis_vram_total / (1024**3),
            'available_gb': (vis_vram_total - vis_vram_used) / (1024**3),
            'usage_percent': (vis_vram_used / vis_vram_total) * 100
        }
    except Exception as e:
        report['vis_vram']['error'] = str(e)

    # GTT
    try:
        gtt_used = amdsmi.amdsmi_get_gpu_memory_usage(
            device, AmdSmiMemoryType.GTT
        )
        gtt_total = amdsmi.amdsmi_get_gpu_memory_total(
            device, AmdSmiMemoryType.GTT
        )

        report['gtt'] = {
            'used_bytes': gtt_used,
            'total_bytes': gtt_total,
            'available_bytes': gtt_total - gtt_used,
            'used_gb': gtt_used / (1024**3),
            'total_gb': gtt_total / (1024**3),
            'available_gb': (gtt_total - gtt_used) / (1024**3),
            'usage_percent': (gtt_used / gtt_total) * 100
        }
    except Exception as e:
        report['gtt']['error'] = str(e)

    # Health check
    try:
        reserved_pages = amdsmi.amdsmi_get_gpu_memory_reserved_pages(device)
        report['health']['reserved_pages'] = len(reserved_pages)
        report['health']['has_errors'] = len(reserved_pages) > 0
        report['health']['pages'] = reserved_pages
    except Exception as e:
        report['health']['error'] = str(e)

    return report

def print_memory_report(device_num, report):
    """Pretty print memory report."""

    print(f"\n{'='*70}")
    print(f"GPU {device_num} - Memory Report")
    print(f"{'='*70}")

    # VRAM
    print("\n[VRAM - Video RAM]")
    if 'error' in report['vram']:
        print(f"  Error: {report['vram']['error']}")
    else:
        vram = report['vram']
        print(f"  Total:     {vram['total_gb']:8.2f} GB")
        print(f"  Used:      {vram['used_gb']:8.2f} GB ({vram['usage_percent']:.1f}%)")
        print(f"  Available: {vram['available_gb']:8.2f} GB")

    # Visible VRAM
    print("\n[Visible VRAM - CPU-accessible]")
    if 'error' in report['vis_vram']:
        print(f"  Not available")
    else:
        vis = report['vis_vram']
        print(f"  Total:     {vis['total_gb']:8.2f} GB")
        print(f"  Used:      {vis['used_gb']:8.2f} GB ({vis['usage_percent']:.1f}%)")
        print(f"  Available: {vis['available_gb']:8.2f} GB")

    # GTT
    print("\n[GTT - Graphics Translation Table]")
    if 'error' in report['gtt']:
        print(f"  Not available")
    else:
        gtt = report['gtt']
        print(f"  Total:     {gtt['total_gb']:8.2f} GB")
        print(f"  Used:      {gtt['used_gb']:8.2f} GB ({gtt['usage_percent']:.1f}%)")
        print(f"  Available: {gtt['available_gb']:8.2f} GB")

    # Health
    print("\n[Memory Health]")
    if 'error' in report['health']:
        print(f"  Error: {report['health']['error']}")
    else:
        health = report['health']
        if health['has_errors']:
            print(f"  Status: WARNING - {health['reserved_pages']} reserved pages found")
            print(f"  Reserved pages indicate memory errors")
        else:
            print(f"  Status: HEALTHY - No reserved pages")

    print(f"\n{'='*70}")

amdsmi.amdsmi_init()

try:
    devices = amdsmi.amdsmi_get_processor_handles()

    for i, device in enumerate(devices):
        report = generate_memory_report(device)
        print_memory_report(i, report)

finally:
    amdsmi.amdsmi_shut_down()

Related Types

AmdSmiMemoryType

Memory type enumeration for querying different memory pools:

class AmdSmiMemoryType(IntEnum):
    """
    GPU memory type identifiers.

    Different memory pools accessible by the GPU and CPU.
    """
    VRAM = ...       # Video RAM - Main GPU memory (GDDR/HBM)
    VIS_VRAM = ...   # CPU-visible VRAM - Portion of VRAM accessible by CPU (BAR)
    GTT = ...        # Graphics Translation Table - System memory usable by GPU

AmdSmiMemoryPageStatus

Memory page status enumeration for reserved pages:

class AmdSmiMemoryPageStatus(IntEnum):
    """
    Status of reserved/retired memory pages.

    Indicates the state of pages that have experienced errors.
    """
    RESERVED = ...      # Page has been reserved (retired) due to errors
    PENDING = ...       # Page retirement is pending
    UNRESERVABLE = ...  # Page cannot be reserved

AmdSmiVramType

VRAM technology type enumeration:

class AmdSmiVramType(IntEnum):
    """
    VRAM technology types.

    Identifies the type of memory technology used for GPU VRAM.
    """
    UNKNOWN = ...   # Unknown VRAM type
    HBM = ...       # High Bandwidth Memory (1st gen)
    HBM2 = ...      # High Bandwidth Memory 2
    HBM2E = ...     # High Bandwidth Memory 2E (Enhanced)
    HBM3 = ...      # High Bandwidth Memory 3
    DDR2 = ...      # DDR2 SDRAM
    DDR3 = ...      # DDR3 SDRAM
    DDR4 = ...      # DDR4 SDRAM
    GDDR1 = ...     # GDDR SDRAM
    GDDR2 = ...     # GDDR2 SDRAM
    GDDR3 = ...     # GDDR3 SDRAM
    GDDR4 = ...     # GDDR4 SDRAM
    GDDR5 = ...     # GDDR5 SDRAM
    GDDR6 = ...     # GDDR6 SDRAM
    GDDR7 = ...     # GDDR7 SDRAM
    MAX = ...       # Maximum value marker

Memory Types Explained

VRAM (Video RAM)

The main GPU memory, typically using GDDR (Graphics DDR) or HBM (High Bandwidth Memory) technology. This is the primary memory pool for GPU operations including:

  • Graphics framebuffers and textures
  • Compute kernel data
  • Command buffers
  • Shader programs

VRAM provides the highest bandwidth for GPU operations but is not directly accessible by the CPU.

Visible VRAM (VIS_VRAM)

A portion of VRAM that is CPU-visible through the PCIe Base Address Register (BAR). This memory can be:

  • Directly written by the CPU without DMA
  • Used for CPU-GPU data transfers
  • Limited in size (often smaller than total VRAM)

The size of visible VRAM depends on the PCIe BAR size, which can be configured in BIOS settings. Larger BAR sizes (Resizable BAR or Smart Access Memory) improve CPU-GPU transfer performance.

GTT (Graphics Translation Table)

System RAM that the GPU can access through the GTT aperture. This provides:

  • Overflow memory when VRAM is full
  • Staging area for CPU-GPU transfers
  • Lower bandwidth than VRAM but larger capacity

The GPU uses the GTT to map system memory pages, allowing access to data stored in system RAM. This is useful for large datasets that don't fit in VRAM.

Notes

  • All memory sizes are returned in bytes; divide by (1024**3) to convert to GB
  • Memory usage values represent actively allocated memory, not physical usage
  • Reserved pages indicate uncorrectable memory errors (ECC or other error detection)
  • Multiple reserved pages may indicate failing memory hardware
  • Visible VRAM size depends on PCIe BAR configuration
  • GTT memory provides access to system RAM but with higher latency than VRAM
  • Memory type availability depends on GPU architecture and driver support
  • For memory health monitoring, check reserved pages regularly
  • High numbers of reserved pages (>10) suggest memory degradation
  • Reserved pages are permanently removed from the available memory pool
  • Use amdsmi_get_gpu_vram_usage() from GPU Monitoring for simpler VRAM queries
  • Memory pressure above 90% may impact GPU performance
  • Consider both VRAM and visible VRAM usage for comprehensive monitoring