or run

npx @tessl/cli init

GPU Event Handling

Real-time event notification system for monitoring GPU events, faults, and state changes. The event handling API enables applications to receive asynchronous notifications about critical GPU events including thermal throttling, GPU resets, memory faults, process lifecycle events, and queue operations.

Capabilities

Event Reader Class

The AmdSmiEventReader class provides a context-manager interface for subscribing to and reading GPU events.

class AmdSmiEventReader:
    """
    Event reader for monitoring GPU events and notifications.

    Provides a context-manager interface for subscribing to specific GPU event types
    and reading event notifications asynchronously. The reader initializes event
    notification for a GPU device and sets up a notification mask for the requested
    event types.

    Constructor:
    ```python
    def __init__(
        self,
        processor_handle: processor_handle,
        event_types: List[AmdSmiEvtNotificationType]
    ):
        """
        Initialize event reader for a GPU device with specified event types.

        Sets up event notification infrastructure for the specified GPU device and
        configures the event notification mask to receive only the requested event
        types. The event mask is constructed as a bitmask where each event type
        corresponds to a specific bit position.

        Parameters:
        - processor_handle: Handle for the target GPU device
        - event_types (List[AmdSmiEvtNotificationType]): List of event types to monitor

        Raises:
        - AmdSmiParameterException: If processor_handle is not valid or event_types
          is not iterable or contains invalid event types
        - AmdSmiLibraryException: On initialization failure

        Note:
        - Each event type (except NONE) sets a bit in the notification mask
        - The mask is calculated as: mask |= (1 << (event_type - 1))
        - Event notification is initialized immediately upon construction
        - Use as a context manager with 'with' statement for automatic cleanup

        Example:
        ```python
        import amdsmi

        amdsmi.amdsmi_init()
        devices = amdsmi.amdsmi_get_processor_handles()
        device = devices[0]

        # Create event reader with multiple event types
        event_types = [
            amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
            amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
            amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
        ]

        # Use context manager for automatic cleanup
        with amdsmi.AmdSmiEventReader(device, event_types) as reader:
            # Reader is initialized and ready to receive events
            events = reader.read(timestamp=0)

        # Event reader automatically cleaned up on exit
        amdsmi.amdsmi_shut_down()
        ```
        """
    ```

    Methods:
    - read(timestamp: int, num_elem: int = 10) -> List[dict]
    - stop() -> None
    - __enter__() -> AmdSmiEventReader
    - __exit__(exc_type, exc_value, traceback) -> None
    """

Reading Events

Read pending GPU events from the event queue.

def read(self, timestamp: int, num_elem: int = 10) -> List[Dict[str, Any]]:
    """
    Read GPU event notifications from the event queue.

    Retrieves up to num_elem event notifications that have occurred since the
    specified timestamp. Events are returned as a list of dictionaries containing
    the processor handle, event type name, and event message.

    Parameters:
    - timestamp (int): Timestamp marking the start of the event collection window.
      Events with timestamps greater than this value are returned. Use 0 to get
      all pending events, or use the current time to get only new events.
    - num_elem (int, optional): Maximum number of events to retrieve. Defaults to 10.
      This determines the size of the internal event buffer.

    Returns:
    - List[dict]: List of event dictionaries, each containing:
      - processor_handle (processor_handle): GPU device handle that generated the event
      - event (str): Event type name (e.g., "THERMAL_THROTTLE", "GPU_PRE_RESET")
      - message (str): Human-readable event message with additional details

    Raises:
    - AmdSmiLibraryException: On event retrieval failure

    Note:
    - Only events matching the initialized event mask are returned
    - Events with type "NONE" are automatically filtered out
    - The actual number of events returned may be less than num_elem
    - Event timestamps are managed internally by the driver
    - Messages are decoded from UTF-8 and may contain event-specific information
    - Events are consumed from the queue upon reading

    Example:
    ```python
    import amdsmi
    import time

    amdsmi.amdsmi_init()
    devices = amdsmi.amdsmi_get_processor_handles()

    event_types = [
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.VMFAULT
    ]

    with amdsmi.AmdSmiEventReader(devices[0], event_types) as reader:
        # Get all pending events
        events = reader.read(timestamp=0, num_elem=20)

        for event in events:
            print(f"Event: {event['event']}")
            print(f"  Message: {event['message']}")
            print(f"  Device: {event['processor_handle']}")

        # Wait for new events
        time.sleep(5)

        # Get only events that occurred in the last 5 seconds
        current_time = int(time.time())
        new_events = reader.read(timestamp=current_time - 5, num_elem=10)

        print(f"New events: {len(new_events)}")

    amdsmi.amdsmi_shut_down()
    ```
    """

Stopping Event Monitoring

Manually stop event monitoring and clean up resources.

def stop(self) -> None:
    """
    Stop GPU event notification and release resources.

    Stops the event notification system for the GPU device and releases associated
    resources. This method is automatically called when using the event reader as
    a context manager, but can also be called manually if needed.

    Parameters:
    - None

    Returns:
    - None

    Raises:
    - AmdSmiLibraryException: On cleanup failure

    Note:
    - Automatically called by __exit__ when using context manager
    - After calling stop(), the event reader should not be used further
    - Safe to call multiple times (subsequent calls have no effect)

    Example:
    ```python
    import amdsmi

    amdsmi.amdsmi_init()
    devices = amdsmi.amdsmi_get_processor_handles()

    event_types = [amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET]
    reader = amdsmi.AmdSmiEventReader(devices[0], event_types)

    # Read some events
    events = reader.read(timestamp=0)

    # Manually stop when done
    reader.stop()

    amdsmi.amdsmi_shut_down()
    ```
    """

Event Types

The AmdSmiEvtNotificationType enum defines all available GPU event types.

class AmdSmiEvtNotificationType(IntEnum):
    """
    GPU event notification types.

    Enumeration of all available GPU event notification types that can be monitored
    through the AmdSmiEventReader. Events represent various GPU state changes,
    faults, and lifecycle events.

    Values:
    - NONE: No event (used internally, filtered from results)
    - VMFAULT: Virtual memory fault occurred
    - THERMAL_THROTTLE: GPU entered thermal throttling state due to high temperature
    - GPU_PRE_RESET: GPU is about to be reset (notification before reset)
    - GPU_POST_RESET: GPU has been reset (notification after reset completion)
    - MIGRATE_START: Memory migration operation started
    - MIGRATE_END: Memory migration operation completed
    - PAGE_FAULT_START: Page fault handling started
    - PAGE_FAULT_END: Page fault handling completed
    - QUEUE_EVICTION: Compute queue was evicted from GPU
    - QUEUE_RESTORE: Compute queue was restored to GPU
    - UNMAP_FROM_GPU: Memory unmapped from GPU address space
    - PROCESS_START: Process started using GPU
    - PROCESS_END: Process stopped using GPU

    Usage:
    Event types are passed as a list to AmdSmiEventReader to specify which events
    to monitor. Multiple event types can be monitored simultaneously by including
    them in the event_types list.

    Example:
    ```python
    import amdsmi

    # Monitor thermal and reset events
    critical_events = [
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
    ]

    # Monitor memory-related events
    memory_events = [
        amdsmi.AmdSmiEvtNotificationType.VMFAULT,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END
    ]

    # Monitor process lifecycle events
    process_events = [
        amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_END
    ]

    # Monitor queue management events
    queue_events = [
        amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
        amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE
    ]
    ```

    Note:
    - Not all event types may be supported on all GPU hardware
    - Event availability depends on GPU generation and driver version
    - The NONE event type is used internally and should not be monitored
    - Events are delivered asynchronously as they occur
    - Multiple event readers can monitor different event types on the same GPU
    """

Usage Examples

Basic Event Monitoring

Monitor thermal throttling events on a GPU:

import amdsmi
import time

def monitor_thermal_events(duration=60):
    """Monitor GPU thermal throttling events for a specified duration."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()
    device = devices[0]

    print(f"Monitoring thermal events for {duration} seconds...")

    event_types = [amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE]

    with amdsmi.AmdSmiEventReader(device, event_types) as reader:
        start_time = time.time()
        event_count = 0

        while time.time() - start_time < duration:
            # Read events every second
            events = reader.read(timestamp=0, num_elem=10)

            for event in events:
                event_count += 1
                print(f"\n[{time.time() - start_time:.1f}s] Thermal Event #{event_count}")
                print(f"  Event Type: {event['event']}")
                print(f"  Message: {event['message']}")

            time.sleep(1)

        print(f"\nTotal thermal events: {event_count}")

    amdsmi.amdsmi_shut_down()

# Run monitor
monitor_thermal_events(duration=30)

GPU Reset Monitor

Monitor GPU reset events with pre and post-reset notifications:

import amdsmi
import time
from datetime import datetime

def monitor_gpu_resets():
    """Monitor GPU reset events and track reset timing."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    reset_events = [
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
    ]

    print("GPU Reset Monitor - Press Ctrl+C to stop\n")

    readers = []
    try:
        # Create reader for each GPU
        for idx, device in enumerate(devices):
            bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
            reader = amdsmi.AmdSmiEventReader(device, reset_events)
            readers.append((idx, bdf, reader))
            print(f"Monitoring GPU {idx} ({bdf})")

        print("\nWaiting for reset events...\n")

        reset_tracker = {}  # Track reset timing per GPU

        while True:
            for gpu_idx, bdf, reader in readers:
                events = reader.read(timestamp=0, num_elem=5)

                for event in events:
                    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

                    if event['event'] == 'GPU_PRE_RESET':
                        print(f"[{timestamp}] GPU {gpu_idx} ({bdf})")
                        print(f"  PRE-RESET: GPU is about to reset")
                        print(f"  Message: {event['message']}")
                        reset_tracker[gpu_idx] = time.time()

                    elif event['event'] == 'GPU_POST_RESET':
                        print(f"[{timestamp}] GPU {gpu_idx} ({bdf})")
                        print(f"  POST-RESET: GPU reset completed")
                        print(f"  Message: {event['message']}")

                        if gpu_idx in reset_tracker:
                            duration = time.time() - reset_tracker[gpu_idx]
                            print(f"  Reset Duration: {duration:.2f} seconds")
                            del reset_tracker[gpu_idx]

                        print()

            time.sleep(1)

    except KeyboardInterrupt:
        print("\nMonitoring stopped by user")

    finally:
        # Clean up readers
        for _, _, reader in readers:
            reader.stop()
        amdsmi.amdsmi_shut_down()

# Run monitor
monitor_gpu_resets()

Multi-Event Comprehensive Monitor

Monitor multiple event types simultaneously:

import amdsmi
import time
from collections import defaultdict

def comprehensive_event_monitor(duration=300):
    """Monitor all GPU events and generate statistics."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    # Monitor all important event types
    all_events = [
        amdsmi.AmdSmiEvtNotificationType.VMFAULT,
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET,
        amdsmi.AmdSmiEvtNotificationType.MIGRATE_START,
        amdsmi.AmdSmiEvtNotificationType.MIGRATE_END,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
        amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
        amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE,
        amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_END
    ]

    print(f"Comprehensive GPU Event Monitor")
    print(f"Duration: {duration} seconds")
    print(f"Monitoring {len(devices)} GPU(s)")
    print(f"Event types: {len(all_events)}\n")

    # Statistics tracking
    event_stats = defaultdict(lambda: defaultdict(int))
    event_details = []

    readers = []
    try:
        # Create readers for all devices
        for device in devices:
            reader = amdsmi.AmdSmiEventReader(device, all_events)
            readers.append(reader)

        start_time = time.time()
        last_report = start_time

        while time.time() - start_time < duration:
            for idx, reader in enumerate(readers):
                events = reader.read(timestamp=0, num_elem=20)

                for event in events:
                    event_type = event['event']
                    event_stats[idx][event_type] += 1

                    event_details.append({
                        'gpu': idx,
                        'time': time.time() - start_time,
                        'type': event_type,
                        'message': event['message']
                    })

                    # Print event in real-time
                    print(f"[{time.time() - start_time:.1f}s] GPU {idx}: {event_type}")
                    print(f"  {event['message']}")

            # Print periodic summary
            if time.time() - last_report >= 60:
                print("\n--- Event Summary (Last 60s) ---")
                total = sum(sum(counts.values()) for counts in event_stats.values())
                print(f"Total events: {total}")
                last_report = time.time()
                print()

            time.sleep(1)

    except KeyboardInterrupt:
        print("\nMonitoring interrupted by user")

    finally:
        # Clean up
        for reader in readers:
            reader.stop()

        # Print final statistics
        print("\n" + "=" * 80)
        print("FINAL EVENT STATISTICS")
        print("=" * 80)

        for gpu_idx in sorted(event_stats.keys()):
            bdf = amdsmi.amdsmi_get_gpu_device_bdf(devices[gpu_idx])
            print(f"\nGPU {gpu_idx} ({bdf}):")

            if not event_stats[gpu_idx]:
                print("  No events recorded")
                continue

            for event_type, count in sorted(event_stats[gpu_idx].items(),
                                          key=lambda x: x[1],
                                          reverse=True):
                print(f"  {event_type}: {count}")

            total = sum(event_stats[gpu_idx].values())
            print(f"  Total: {total}")

        # Overall statistics
        overall_total = sum(sum(counts.values()) for counts in event_stats.values())
        print(f"\nOverall Total Events: {overall_total}")
        print(f"Monitoring Duration: {time.time() - start_time:.1f} seconds")
        print(f"Events per minute: {overall_total / ((time.time() - start_time) / 60):.2f}")

        amdsmi.amdsmi_shut_down()

    return event_details, event_stats

# Run comprehensive monitor
details, stats = comprehensive_event_monitor(duration=120)

Memory Fault Monitor

Specifically monitor memory-related events:

import amdsmi
import time

def monitor_memory_events():
    """Monitor memory faults and page events."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    memory_events = [
        amdsmi.AmdSmiEvtNotificationType.VMFAULT,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
        amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU
    ]

    print("Memory Event Monitor")
    print("=" * 80)

    fault_count = 0
    page_fault_active = {}

    try:
        readers = []
        for device in devices:
            reader = amdsmi.AmdSmiEventReader(device, memory_events)
            readers.append(reader)

        print("Monitoring memory events. Press Ctrl+C to stop.\n")

        while True:
            for idx, reader in enumerate(readers):
                events = reader.read(timestamp=0, num_elem=10)

                for event in events:
                    event_type = event['event']

                    if event_type == 'VMFAULT':
                        fault_count += 1
                        print(f"\n!!! VM FAULT #{fault_count} on GPU {idx} !!!")
                        print(f"  Message: {event['message']}")
                        print(f"  Time: {time.strftime('%H:%M:%S')}")

                    elif event_type == 'PAGE_FAULT_START':
                        page_fault_active[idx] = time.time()
                        print(f"\nPage Fault Started on GPU {idx}")
                        print(f"  Message: {event['message']}")

                    elif event_type == 'PAGE_FAULT_END':
                        if idx in page_fault_active:
                            duration = time.time() - page_fault_active[idx]
                            print(f"\nPage Fault Resolved on GPU {idx}")
                            print(f"  Duration: {duration:.4f} seconds")
                            print(f"  Message: {event['message']}")
                            del page_fault_active[idx]

                    elif event_type == 'UNMAP_FROM_GPU':
                        print(f"\nMemory Unmapped from GPU {idx}")
                        print(f"  Message: {event['message']}")

            time.sleep(0.5)  # Poll more frequently for memory events

    except KeyboardInterrupt:
        print("\n\nMonitoring stopped")
        print(f"Total VM faults: {fault_count}")

    finally:
        for reader in readers:
            reader.stop()
        amdsmi.amdsmi_shut_down()

# Run memory monitor
monitor_memory_events()

Process Lifecycle Monitor

Track process start and stop events:

import amdsmi
import time
from datetime import datetime

def monitor_process_lifecycle():
    """Monitor GPU process start and stop events."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    process_events = [
        amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_END
    ]

    print("GPU Process Lifecycle Monitor")
    print("=" * 80)
    print()

    active_processes = {}  # Track process start times

    try:
        readers = []
        for idx, device in enumerate(devices):
            bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
            reader = amdsmi.AmdSmiEventReader(device, process_events)
            readers.append((idx, bdf, reader))
            print(f"Monitoring GPU {idx} ({bdf})")

        print("\nPress Ctrl+C to stop and view summary\n")

        process_count = 0

        while True:
            for gpu_idx, bdf, reader in readers:
                events = reader.read(timestamp=0, num_elem=10)

                for event in events:
                    timestamp = datetime.now().strftime("%H:%M:%S")

                    if event['event'] == 'PROCESS_START':
                        process_count += 1
                        process_key = f"GPU{gpu_idx}_{process_count}"
                        active_processes[process_key] = {
                            'gpu': gpu_idx,
                            'bdf': bdf,
                            'start_time': time.time(),
                            'message': event['message']
                        }

                        print(f"[{timestamp}] Process Started on GPU {gpu_idx}")
                        print(f"  BDF: {bdf}")
                        print(f"  Details: {event['message']}")
                        print(f"  Active processes: {len(active_processes)}")
                        print()

                    elif event['event'] == 'PROCESS_END':
                        print(f"[{timestamp}] Process Ended on GPU {gpu_idx}")
                        print(f"  BDF: {bdf}")
                        print(f"  Details: {event['message']}")

                        # Try to match with started process
                        matching_key = None
                        for key, proc in active_processes.items():
                            if proc['gpu'] == gpu_idx:
                                matching_key = key
                                break

                        if matching_key:
                            proc = active_processes[matching_key]
                            duration = time.time() - proc['start_time']
                            print(f"  Duration: {duration:.2f} seconds")
                            del active_processes[matching_key]

                        print(f"  Active processes: {len(active_processes)}")
                        print()

            time.sleep(1)

    except KeyboardInterrupt:
        print("\n" + "=" * 80)
        print("PROCESS LIFECYCLE SUMMARY")
        print("=" * 80)

        if active_processes:
            print(f"\nStill active processes: {len(active_processes)}")
            for key, proc in active_processes.items():
                duration = time.time() - proc['start_time']
                print(f"  GPU {proc['gpu']}: Running for {duration:.2f} seconds")
        else:
            print("\nNo active processes")

    finally:
        for _, _, reader in readers:
            reader.stop()
        amdsmi.amdsmi_shut_down()

# Run process monitor
monitor_process_lifecycle()

Event Logger with File Output

Log all events to a file for later analysis:

import amdsmi
import time
import json
from datetime import datetime

def event_logger(output_file="gpu_events.log", duration=300):
    """Log all GPU events to a file with timestamps."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    all_events = [
        amdsmi.AmdSmiEvtNotificationType.VMFAULT,
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET,
        amdsmi.AmdSmiEvtNotificationType.MIGRATE_START,
        amdsmi.AmdSmiEvtNotificationType.MIGRATE_END,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
        amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
        amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
        amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE,
        amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
        amdsmi.AmdSmiEvtNotificationType.PROCESS_END
    ]

    print(f"GPU Event Logger")
    print(f"Output file: {output_file}")
    print(f"Duration: {duration} seconds")
    print(f"Monitoring {len(devices)} GPU(s)\n")

    readers = []
    event_log = []

    try:
        # Create readers
        for idx, device in enumerate(devices):
            reader = amdsmi.AmdSmiEventReader(device, all_events)
            bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
            readers.append((idx, bdf, reader))

        start_time = time.time()
        event_count = 0

        print("Logging events... Press Ctrl+C to stop early\n")

        while time.time() - start_time < duration:
            for gpu_idx, bdf, reader in readers:
                events = reader.read(timestamp=0, num_elem=20)

                for event in events:
                    event_count += 1

                    log_entry = {
                        'event_id': event_count,
                        'timestamp': datetime.now().isoformat(),
                        'elapsed_seconds': time.time() - start_time,
                        'gpu_index': gpu_idx,
                        'gpu_bdf': bdf,
                        'event_type': event['event'],
                        'message': event['message']
                    }

                    event_log.append(log_entry)

                    print(f"[{event_count}] {log_entry['timestamp']} - "
                          f"GPU {gpu_idx}: {event['event']}")

            time.sleep(1)

    except KeyboardInterrupt:
        print("\nLogging stopped by user")

    finally:
        # Clean up readers
        for _, _, reader in readers:
            reader.stop()

        # Write log to file
        print(f"\nWriting {len(event_log)} events to {output_file}...")

        with open(output_file, 'w') as f:
            # Write metadata
            metadata = {
                'log_created': datetime.now().isoformat(),
                'total_events': len(event_log),
                'gpu_count': len(devices),
                'duration_seconds': time.time() - start_time,
                'events': event_log
            }

            json.dump(metadata, f, indent=2)

        print(f"Log file written successfully")
        print(f"Total events logged: {len(event_log)}")

        amdsmi.amdsmi_shut_down()

    return event_log

# Run logger
logged_events = event_logger(output_file="gpu_events.log", duration=120)

Alert System for Critical Events

Monitor for critical events and trigger alerts:

import amdsmi
import time
from datetime import datetime

def critical_event_alerter(alert_callback=None):
    """Monitor critical GPU events and trigger alerts."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    # Define critical events
    critical_events = [
        amdsmi.AmdSmiEvtNotificationType.VMFAULT,
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
    ]

    # Severity levels
    severity_map = {
        'VMFAULT': 'CRITICAL',
        'THERMAL_THROTTLE': 'WARNING',
        'GPU_PRE_RESET': 'CRITICAL',
        'GPU_POST_RESET': 'WARNING'
    }

    def default_alert(event_info):
        """Default alert handler - prints to console."""
        print("\n" + "!" * 80)
        print(f"ALERT: {event_info['severity']} - {event_info['event_type']}")
        print("!" * 80)
        print(f"Time: {event_info['timestamp']}")
        print(f"GPU: {event_info['gpu_index']} ({event_info['gpu_bdf']})")
        print(f"Message: {event_info['message']}")
        print("!" * 80 + "\n")

    if alert_callback is None:
        alert_callback = default_alert

    print("Critical Event Alert System")
    print("=" * 80)
    print(f"Monitoring {len(devices)} GPU(s)")
    print(f"Critical event types: {len(critical_events)}")
    print("\nPress Ctrl+C to stop\n")

    readers = []
    alert_count = 0

    try:
        # Create readers
        for device in devices:
            reader = amdsmi.AmdSmiEventReader(device, critical_events)
            readers.append(reader)

        while True:
            for idx, reader in enumerate(readers):
                events = reader.read(timestamp=0, num_elem=10)

                for event in events:
                    alert_count += 1
                    event_type = event['event']

                    bdf = amdsmi.amdsmi_get_gpu_device_bdf(devices[idx])

                    event_info = {
                        'alert_id': alert_count,
                        'timestamp': datetime.now().isoformat(),
                        'gpu_index': idx,
                        'gpu_bdf': bdf,
                        'event_type': event_type,
                        'severity': severity_map.get(event_type, 'UNKNOWN'),
                        'message': event['message']
                    }

                    # Trigger alert callback
                    alert_callback(event_info)

            time.sleep(1)

    except KeyboardInterrupt:
        print(f"\nAlert system stopped")
        print(f"Total alerts: {alert_count}")

    finally:
        for reader in readers:
            reader.stop()
        amdsmi.amdsmi_shut_down()

# Example with custom alert handler
def custom_alert_handler(event_info):
    """Custom alert handler - could send email, SMS, etc."""
    if event_info['severity'] == 'CRITICAL':
        print(f"\n*** CRITICAL ALERT #{event_info['alert_id']} ***")
        print(f"GPU {event_info['gpu_index']}: {event_info['event_type']}")
        print(f"Action: Notify system administrator")
        # Add your notification logic here (email, SMS, webhook, etc.)
    else:
        print(f"\nWarning: {event_info['event_type']} on GPU {event_info['gpu_index']}")

# Run alerter with custom handler
critical_event_alerter(alert_callback=custom_alert_handler)

Integration Patterns

Event-Driven GPU Health Monitor

Combine event monitoring with health checks:

import amdsmi
import time

def health_aware_event_monitor():
    """Monitor events and check GPU health when events occur."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    # Monitor health-related events
    health_events = [
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
        amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
    ]

    readers = []
    for device in devices:
        reader = amdsmi.AmdSmiEventReader(device, health_events)
        readers.append(reader)

    print("Event-Driven GPU Health Monitor\n")

    try:
        while True:
            for idx, reader in enumerate(readers):
                device = devices[idx]
                events = reader.read(timestamp=0, num_elem=5)

                if events:
                    bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
                    print(f"\n--- GPU {idx} ({bdf}) Event Detected ---")

                    for event in events:
                        print(f"Event: {event['event']}")

                    # Perform health check
                    print("\nGPU Health Status:")

                    try:
                        # Check temperature
                        temp = amdsmi.amdsmi_get_temp_metric(
                            device,
                            amdsmi.AmdSmiTemperatureType.EDGE,
                            amdsmi.AmdSmiTemperatureMetric.CURRENT
                        )
                        print(f"  Temperature: {temp/1000:.1f}°C")
                    except:
                        print("  Temperature: N/A")

                    try:
                        # Check power
                        power = amdsmi.amdsmi_get_power_info(device)
                        print(f"  Power: {power.get('average_socket_power', 0)/1000000:.1f}W")
                    except:
                        print("  Power: N/A")

                    try:
                        # Check activity
                        activity = amdsmi.amdsmi_get_gpu_activity(device)
                        print(f"  GFX Activity: {activity.get('gfx_activity', 0)}%")
                    except:
                        print("  Activity: N/A")

                    print()

            time.sleep(2)

    except KeyboardInterrupt:
        print("\nMonitoring stopped")

    finally:
        for reader in readers:
            reader.stop()
        amdsmi.amdsmi_shut_down()

# Run health monitor
health_aware_event_monitor()

Context Manager Usage Pattern

Proper use of event reader as context manager:

import amdsmi
import time

def safe_event_monitoring():
    """Demonstrate safe event monitoring with context managers."""
    amdsmi.amdsmi_init()

    devices = amdsmi.amdsmi_get_processor_handles()

    if not devices:
        print("No GPU devices found")
        amdsmi.amdsmi_shut_down()
        return

    device = devices[0]

    # Define events to monitor
    events_to_monitor = [
        amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
        amdsmi.AmdSmiEvtNotificationType.VMFAULT
    ]

    print("Safe Event Monitoring with Context Manager\n")

    # Using context manager ensures proper cleanup
    with amdsmi.AmdSmiEventReader(device, events_to_monitor) as reader:
        print("Event reader initialized and monitoring...")

        for i in range(10):  # Monitor for 10 iterations
            events = reader.read(timestamp=0, num_elem=10)

            if events:
                print(f"\nIteration {i+1}: {len(events)} event(s) received")
                for event in events:
                    print(f"  {event['event']}: {event['message']}")
            else:
                print(f"Iteration {i+1}: No events")

            time.sleep(1)

    # Reader automatically stopped and cleaned up here
    print("\nEvent reader automatically cleaned up")

    amdsmi.amdsmi_shut_down()

# Run safe monitoring
safe_event_monitoring()

Notes

Event Queue: Events are queued by the driver and consumed when read; ensure regular polling to avoid queue overflow
Timestamp Filtering: The timestamp parameter filters events based on occurrence time; use 0 to get all pending events
Event Availability: Not all event types are supported on all GPU hardware; availability depends on GPU generation and driver
Context Manager: Always use AmdSmiEventReader as a context manager (with statement) to ensure proper cleanup
Multiple Readers: Multiple event readers can be created for the same GPU with different event type masks
Event Mask: Events are internally managed using a bitmask where each event type has a specific bit position
Performance: Event polling frequency should balance responsiveness with CPU overhead; 0.5-2 second intervals are typical
Thread Safety: Event readers are not thread-safe; use separate readers for separate threads if needed
Buffer Size: The num_elem parameter controls the maximum events returned per read; adjust based on expected event frequency
Event Delivery: Events are delivered asynchronously; there may be a small delay between event occurrence and availability
Resource Cleanup: Always call stop() or use context manager to properly release event notification resources
Error Handling: Wrap event operations in try-except blocks as GPUs can be removed or reset during monitoring
Driver Version: Event support and event types available depend on the AMD GPU driver version
Memory Management: Event data is allocated internally; larger num_elem values consume more memory

Related Functions

GPU Monitoring - GPU activity and utilization tracking
GPU Errors - Error counting and RAS features
GPU Processes - Process monitoring and tracking
GPU Temperature - Temperature monitoring and thermal management
Device Discovery - GPU device enumeration

Version

Tile

Files

event-handling.mddocs/reference/

GPU Event Handling

Capabilities

Event Reader Class

Reading Events

Stopping Event Monitoring

Event Types

Usage Examples

Basic Event Monitoring

GPU Reset Monitor

Multi-Event Comprehensive Monitor

Memory Fault Monitor

Process Lifecycle Monitor

Event Logger with File Output

Alert System for Critical Events

Integration Patterns

Event-Driven GPU Health Monitor

Context Manager Usage Pattern

Notes

Related Functions

Version

Tile

Files

event-handling.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/reference/

GPU Event Handling

Capabilities

Event Reader Class

Reading Events

Stopping Event Monitoring

Event Types

Usage Examples

Basic Event Monitoring

GPU Reset Monitor

Multi-Event Comprehensive Monitor

Memory Fault Monitor

Process Lifecycle Monitor

Event Logger with File Output

Alert System for Critical Events

Integration Patterns

Event-Driven GPU Health Monitor

Context Manager Usage Pattern

Notes

Related Functions

event-handling.mddocs/reference/