Real-time event notification system for monitoring GPU events, faults, and state changes. The event handling API enables applications to receive asynchronous notifications about critical GPU events including thermal throttling, GPU resets, memory faults, process lifecycle events, and queue operations.
The AmdSmiEventReader class provides a context-manager interface for subscribing to and reading GPU events.
class AmdSmiEventReader:
"""
Event reader for monitoring GPU events and notifications.
Provides a context-manager interface for subscribing to specific GPU event types
and reading event notifications asynchronously. The reader initializes event
notification for a GPU device and sets up a notification mask for the requested
event types.
Constructor:
```python
def __init__(
self,
processor_handle: processor_handle,
event_types: List[AmdSmiEvtNotificationType]
):
"""
Initialize event reader for a GPU device with specified event types.
Sets up event notification infrastructure for the specified GPU device and
configures the event notification mask to receive only the requested event
types. The event mask is constructed as a bitmask where each event type
corresponds to a specific bit position.
Parameters:
- processor_handle: Handle for the target GPU device
- event_types (List[AmdSmiEvtNotificationType]): List of event types to monitor
Raises:
- AmdSmiParameterException: If processor_handle is not valid or event_types
is not iterable or contains invalid event types
- AmdSmiLibraryException: On initialization failure
Note:
- Each event type (except NONE) sets a bit in the notification mask
- The mask is calculated as: mask |= (1 << (event_type - 1))
- Event notification is initialized immediately upon construction
- Use as a context manager with 'with' statement for automatic cleanup
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
device = devices[0]
# Create event reader with multiple event types
event_types = [
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
]
# Use context manager for automatic cleanup
with amdsmi.AmdSmiEventReader(device, event_types) as reader:
# Reader is initialized and ready to receive events
events = reader.read(timestamp=0)
# Event reader automatically cleaned up on exit
amdsmi.amdsmi_shut_down()
```
"""
```
Methods:
- read(timestamp: int, num_elem: int = 10) -> List[dict]
- stop() -> None
- __enter__() -> AmdSmiEventReader
- __exit__(exc_type, exc_value, traceback) -> None
"""Read pending GPU events from the event queue.
def read(self, timestamp: int, num_elem: int = 10) -> List[Dict[str, Any]]:
"""
Read GPU event notifications from the event queue.
Retrieves up to num_elem event notifications that have occurred since the
specified timestamp. Events are returned as a list of dictionaries containing
the processor handle, event type name, and event message.
Parameters:
- timestamp (int): Timestamp marking the start of the event collection window.
Events with timestamps greater than this value are returned. Use 0 to get
all pending events, or use the current time to get only new events.
- num_elem (int, optional): Maximum number of events to retrieve. Defaults to 10.
This determines the size of the internal event buffer.
Returns:
- List[dict]: List of event dictionaries, each containing:
- processor_handle (processor_handle): GPU device handle that generated the event
- event (str): Event type name (e.g., "THERMAL_THROTTLE", "GPU_PRE_RESET")
- message (str): Human-readable event message with additional details
Raises:
- AmdSmiLibraryException: On event retrieval failure
Note:
- Only events matching the initialized event mask are returned
- Events with type "NONE" are automatically filtered out
- The actual number of events returned may be less than num_elem
- Event timestamps are managed internally by the driver
- Messages are decoded from UTF-8 and may contain event-specific information
- Events are consumed from the queue upon reading
Example:
```python
import amdsmi
import time
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
event_types = [
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.VMFAULT
]
with amdsmi.AmdSmiEventReader(devices[0], event_types) as reader:
# Get all pending events
events = reader.read(timestamp=0, num_elem=20)
for event in events:
print(f"Event: {event['event']}")
print(f" Message: {event['message']}")
print(f" Device: {event['processor_handle']}")
# Wait for new events
time.sleep(5)
# Get only events that occurred in the last 5 seconds
current_time = int(time.time())
new_events = reader.read(timestamp=current_time - 5, num_elem=10)
print(f"New events: {len(new_events)}")
amdsmi.amdsmi_shut_down()
```
"""Manually stop event monitoring and clean up resources.
def stop(self) -> None:
"""
Stop GPU event notification and release resources.
Stops the event notification system for the GPU device and releases associated
resources. This method is automatically called when using the event reader as
a context manager, but can also be called manually if needed.
Parameters:
- None
Returns:
- None
Raises:
- AmdSmiLibraryException: On cleanup failure
Note:
- Automatically called by __exit__ when using context manager
- After calling stop(), the event reader should not be used further
- Safe to call multiple times (subsequent calls have no effect)
Example:
```python
import amdsmi
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
event_types = [amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET]
reader = amdsmi.AmdSmiEventReader(devices[0], event_types)
# Read some events
events = reader.read(timestamp=0)
# Manually stop when done
reader.stop()
amdsmi.amdsmi_shut_down()
```
"""The AmdSmiEvtNotificationType enum defines all available GPU event types.
class AmdSmiEvtNotificationType(IntEnum):
"""
GPU event notification types.
Enumeration of all available GPU event notification types that can be monitored
through the AmdSmiEventReader. Events represent various GPU state changes,
faults, and lifecycle events.
Values:
- NONE: No event (used internally, filtered from results)
- VMFAULT: Virtual memory fault occurred
- THERMAL_THROTTLE: GPU entered thermal throttling state due to high temperature
- GPU_PRE_RESET: GPU is about to be reset (notification before reset)
- GPU_POST_RESET: GPU has been reset (notification after reset completion)
- MIGRATE_START: Memory migration operation started
- MIGRATE_END: Memory migration operation completed
- PAGE_FAULT_START: Page fault handling started
- PAGE_FAULT_END: Page fault handling completed
- QUEUE_EVICTION: Compute queue was evicted from GPU
- QUEUE_RESTORE: Compute queue was restored to GPU
- UNMAP_FROM_GPU: Memory unmapped from GPU address space
- PROCESS_START: Process started using GPU
- PROCESS_END: Process stopped using GPU
Usage:
Event types are passed as a list to AmdSmiEventReader to specify which events
to monitor. Multiple event types can be monitored simultaneously by including
them in the event_types list.
Example:
```python
import amdsmi
# Monitor thermal and reset events
critical_events = [
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
]
# Monitor memory-related events
memory_events = [
amdsmi.AmdSmiEvtNotificationType.VMFAULT,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END
]
# Monitor process lifecycle events
process_events = [
amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
amdsmi.AmdSmiEvtNotificationType.PROCESS_END
]
# Monitor queue management events
queue_events = [
amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE
]
```
Note:
- Not all event types may be supported on all GPU hardware
- Event availability depends on GPU generation and driver version
- The NONE event type is used internally and should not be monitored
- Events are delivered asynchronously as they occur
- Multiple event readers can monitor different event types on the same GPU
"""Monitor thermal throttling events on a GPU:
import amdsmi
import time
def monitor_thermal_events(duration=60):
"""Monitor GPU thermal throttling events for a specified duration."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
device = devices[0]
print(f"Monitoring thermal events for {duration} seconds...")
event_types = [amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE]
with amdsmi.AmdSmiEventReader(device, event_types) as reader:
start_time = time.time()
event_count = 0
while time.time() - start_time < duration:
# Read events every second
events = reader.read(timestamp=0, num_elem=10)
for event in events:
event_count += 1
print(f"\n[{time.time() - start_time:.1f}s] Thermal Event #{event_count}")
print(f" Event Type: {event['event']}")
print(f" Message: {event['message']}")
time.sleep(1)
print(f"\nTotal thermal events: {event_count}")
amdsmi.amdsmi_shut_down()
# Run monitor
monitor_thermal_events(duration=30)Monitor GPU reset events with pre and post-reset notifications:
import amdsmi
import time
from datetime import datetime
def monitor_gpu_resets():
"""Monitor GPU reset events and track reset timing."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
reset_events = [
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
]
print("GPU Reset Monitor - Press Ctrl+C to stop\n")
readers = []
try:
# Create reader for each GPU
for idx, device in enumerate(devices):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
reader = amdsmi.AmdSmiEventReader(device, reset_events)
readers.append((idx, bdf, reader))
print(f"Monitoring GPU {idx} ({bdf})")
print("\nWaiting for reset events...\n")
reset_tracker = {} # Track reset timing per GPU
while True:
for gpu_idx, bdf, reader in readers:
events = reader.read(timestamp=0, num_elem=5)
for event in events:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
if event['event'] == 'GPU_PRE_RESET':
print(f"[{timestamp}] GPU {gpu_idx} ({bdf})")
print(f" PRE-RESET: GPU is about to reset")
print(f" Message: {event['message']}")
reset_tracker[gpu_idx] = time.time()
elif event['event'] == 'GPU_POST_RESET':
print(f"[{timestamp}] GPU {gpu_idx} ({bdf})")
print(f" POST-RESET: GPU reset completed")
print(f" Message: {event['message']}")
if gpu_idx in reset_tracker:
duration = time.time() - reset_tracker[gpu_idx]
print(f" Reset Duration: {duration:.2f} seconds")
del reset_tracker[gpu_idx]
print()
time.sleep(1)
except KeyboardInterrupt:
print("\nMonitoring stopped by user")
finally:
# Clean up readers
for _, _, reader in readers:
reader.stop()
amdsmi.amdsmi_shut_down()
# Run monitor
monitor_gpu_resets()Monitor multiple event types simultaneously:
import amdsmi
import time
from collections import defaultdict
def comprehensive_event_monitor(duration=300):
"""Monitor all GPU events and generate statistics."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
# Monitor all important event types
all_events = [
amdsmi.AmdSmiEvtNotificationType.VMFAULT,
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET,
amdsmi.AmdSmiEvtNotificationType.MIGRATE_START,
amdsmi.AmdSmiEvtNotificationType.MIGRATE_END,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE,
amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU,
amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
amdsmi.AmdSmiEvtNotificationType.PROCESS_END
]
print(f"Comprehensive GPU Event Monitor")
print(f"Duration: {duration} seconds")
print(f"Monitoring {len(devices)} GPU(s)")
print(f"Event types: {len(all_events)}\n")
# Statistics tracking
event_stats = defaultdict(lambda: defaultdict(int))
event_details = []
readers = []
try:
# Create readers for all devices
for device in devices:
reader = amdsmi.AmdSmiEventReader(device, all_events)
readers.append(reader)
start_time = time.time()
last_report = start_time
while time.time() - start_time < duration:
for idx, reader in enumerate(readers):
events = reader.read(timestamp=0, num_elem=20)
for event in events:
event_type = event['event']
event_stats[idx][event_type] += 1
event_details.append({
'gpu': idx,
'time': time.time() - start_time,
'type': event_type,
'message': event['message']
})
# Print event in real-time
print(f"[{time.time() - start_time:.1f}s] GPU {idx}: {event_type}")
print(f" {event['message']}")
# Print periodic summary
if time.time() - last_report >= 60:
print("\n--- Event Summary (Last 60s) ---")
total = sum(sum(counts.values()) for counts in event_stats.values())
print(f"Total events: {total}")
last_report = time.time()
print()
time.sleep(1)
except KeyboardInterrupt:
print("\nMonitoring interrupted by user")
finally:
# Clean up
for reader in readers:
reader.stop()
# Print final statistics
print("\n" + "=" * 80)
print("FINAL EVENT STATISTICS")
print("=" * 80)
for gpu_idx in sorted(event_stats.keys()):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(devices[gpu_idx])
print(f"\nGPU {gpu_idx} ({bdf}):")
if not event_stats[gpu_idx]:
print(" No events recorded")
continue
for event_type, count in sorted(event_stats[gpu_idx].items(),
key=lambda x: x[1],
reverse=True):
print(f" {event_type}: {count}")
total = sum(event_stats[gpu_idx].values())
print(f" Total: {total}")
# Overall statistics
overall_total = sum(sum(counts.values()) for counts in event_stats.values())
print(f"\nOverall Total Events: {overall_total}")
print(f"Monitoring Duration: {time.time() - start_time:.1f} seconds")
print(f"Events per minute: {overall_total / ((time.time() - start_time) / 60):.2f}")
amdsmi.amdsmi_shut_down()
return event_details, event_stats
# Run comprehensive monitor
details, stats = comprehensive_event_monitor(duration=120)Specifically monitor memory-related events:
import amdsmi
import time
def monitor_memory_events():
"""Monitor memory faults and page events."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
memory_events = [
amdsmi.AmdSmiEvtNotificationType.VMFAULT,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU
]
print("Memory Event Monitor")
print("=" * 80)
fault_count = 0
page_fault_active = {}
try:
readers = []
for device in devices:
reader = amdsmi.AmdSmiEventReader(device, memory_events)
readers.append(reader)
print("Monitoring memory events. Press Ctrl+C to stop.\n")
while True:
for idx, reader in enumerate(readers):
events = reader.read(timestamp=0, num_elem=10)
for event in events:
event_type = event['event']
if event_type == 'VMFAULT':
fault_count += 1
print(f"\n!!! VM FAULT #{fault_count} on GPU {idx} !!!")
print(f" Message: {event['message']}")
print(f" Time: {time.strftime('%H:%M:%S')}")
elif event_type == 'PAGE_FAULT_START':
page_fault_active[idx] = time.time()
print(f"\nPage Fault Started on GPU {idx}")
print(f" Message: {event['message']}")
elif event_type == 'PAGE_FAULT_END':
if idx in page_fault_active:
duration = time.time() - page_fault_active[idx]
print(f"\nPage Fault Resolved on GPU {idx}")
print(f" Duration: {duration:.4f} seconds")
print(f" Message: {event['message']}")
del page_fault_active[idx]
elif event_type == 'UNMAP_FROM_GPU':
print(f"\nMemory Unmapped from GPU {idx}")
print(f" Message: {event['message']}")
time.sleep(0.5) # Poll more frequently for memory events
except KeyboardInterrupt:
print("\n\nMonitoring stopped")
print(f"Total VM faults: {fault_count}")
finally:
for reader in readers:
reader.stop()
amdsmi.amdsmi_shut_down()
# Run memory monitor
monitor_memory_events()Track process start and stop events:
import amdsmi
import time
from datetime import datetime
def monitor_process_lifecycle():
"""Monitor GPU process start and stop events."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
process_events = [
amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
amdsmi.AmdSmiEvtNotificationType.PROCESS_END
]
print("GPU Process Lifecycle Monitor")
print("=" * 80)
print()
active_processes = {} # Track process start times
try:
readers = []
for idx, device in enumerate(devices):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
reader = amdsmi.AmdSmiEventReader(device, process_events)
readers.append((idx, bdf, reader))
print(f"Monitoring GPU {idx} ({bdf})")
print("\nPress Ctrl+C to stop and view summary\n")
process_count = 0
while True:
for gpu_idx, bdf, reader in readers:
events = reader.read(timestamp=0, num_elem=10)
for event in events:
timestamp = datetime.now().strftime("%H:%M:%S")
if event['event'] == 'PROCESS_START':
process_count += 1
process_key = f"GPU{gpu_idx}_{process_count}"
active_processes[process_key] = {
'gpu': gpu_idx,
'bdf': bdf,
'start_time': time.time(),
'message': event['message']
}
print(f"[{timestamp}] Process Started on GPU {gpu_idx}")
print(f" BDF: {bdf}")
print(f" Details: {event['message']}")
print(f" Active processes: {len(active_processes)}")
print()
elif event['event'] == 'PROCESS_END':
print(f"[{timestamp}] Process Ended on GPU {gpu_idx}")
print(f" BDF: {bdf}")
print(f" Details: {event['message']}")
# Try to match with started process
matching_key = None
for key, proc in active_processes.items():
if proc['gpu'] == gpu_idx:
matching_key = key
break
if matching_key:
proc = active_processes[matching_key]
duration = time.time() - proc['start_time']
print(f" Duration: {duration:.2f} seconds")
del active_processes[matching_key]
print(f" Active processes: {len(active_processes)}")
print()
time.sleep(1)
except KeyboardInterrupt:
print("\n" + "=" * 80)
print("PROCESS LIFECYCLE SUMMARY")
print("=" * 80)
if active_processes:
print(f"\nStill active processes: {len(active_processes)}")
for key, proc in active_processes.items():
duration = time.time() - proc['start_time']
print(f" GPU {proc['gpu']}: Running for {duration:.2f} seconds")
else:
print("\nNo active processes")
finally:
for _, _, reader in readers:
reader.stop()
amdsmi.amdsmi_shut_down()
# Run process monitor
monitor_process_lifecycle()Log all events to a file for later analysis:
import amdsmi
import time
import json
from datetime import datetime
def event_logger(output_file="gpu_events.log", duration=300):
"""Log all GPU events to a file with timestamps."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
all_events = [
amdsmi.AmdSmiEvtNotificationType.VMFAULT,
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET,
amdsmi.AmdSmiEvtNotificationType.MIGRATE_START,
amdsmi.AmdSmiEvtNotificationType.MIGRATE_END,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_START,
amdsmi.AmdSmiEvtNotificationType.PAGE_FAULT_END,
amdsmi.AmdSmiEvtNotificationType.QUEUE_EVICTION,
amdsmi.AmdSmiEvtNotificationType.QUEUE_RESTORE,
amdsmi.AmdSmiEvtNotificationType.UNMAP_FROM_GPU,
amdsmi.AmdSmiEvtNotificationType.PROCESS_START,
amdsmi.AmdSmiEvtNotificationType.PROCESS_END
]
print(f"GPU Event Logger")
print(f"Output file: {output_file}")
print(f"Duration: {duration} seconds")
print(f"Monitoring {len(devices)} GPU(s)\n")
readers = []
event_log = []
try:
# Create readers
for idx, device in enumerate(devices):
reader = amdsmi.AmdSmiEventReader(device, all_events)
bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
readers.append((idx, bdf, reader))
start_time = time.time()
event_count = 0
print("Logging events... Press Ctrl+C to stop early\n")
while time.time() - start_time < duration:
for gpu_idx, bdf, reader in readers:
events = reader.read(timestamp=0, num_elem=20)
for event in events:
event_count += 1
log_entry = {
'event_id': event_count,
'timestamp': datetime.now().isoformat(),
'elapsed_seconds': time.time() - start_time,
'gpu_index': gpu_idx,
'gpu_bdf': bdf,
'event_type': event['event'],
'message': event['message']
}
event_log.append(log_entry)
print(f"[{event_count}] {log_entry['timestamp']} - "
f"GPU {gpu_idx}: {event['event']}")
time.sleep(1)
except KeyboardInterrupt:
print("\nLogging stopped by user")
finally:
# Clean up readers
for _, _, reader in readers:
reader.stop()
# Write log to file
print(f"\nWriting {len(event_log)} events to {output_file}...")
with open(output_file, 'w') as f:
# Write metadata
metadata = {
'log_created': datetime.now().isoformat(),
'total_events': len(event_log),
'gpu_count': len(devices),
'duration_seconds': time.time() - start_time,
'events': event_log
}
json.dump(metadata, f, indent=2)
print(f"Log file written successfully")
print(f"Total events logged: {len(event_log)}")
amdsmi.amdsmi_shut_down()
return event_log
# Run logger
logged_events = event_logger(output_file="gpu_events.log", duration=120)Monitor for critical events and trigger alerts:
import amdsmi
import time
from datetime import datetime
def critical_event_alerter(alert_callback=None):
"""Monitor critical GPU events and trigger alerts."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
# Define critical events
critical_events = [
amdsmi.AmdSmiEvtNotificationType.VMFAULT,
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
]
# Severity levels
severity_map = {
'VMFAULT': 'CRITICAL',
'THERMAL_THROTTLE': 'WARNING',
'GPU_PRE_RESET': 'CRITICAL',
'GPU_POST_RESET': 'WARNING'
}
def default_alert(event_info):
"""Default alert handler - prints to console."""
print("\n" + "!" * 80)
print(f"ALERT: {event_info['severity']} - {event_info['event_type']}")
print("!" * 80)
print(f"Time: {event_info['timestamp']}")
print(f"GPU: {event_info['gpu_index']} ({event_info['gpu_bdf']})")
print(f"Message: {event_info['message']}")
print("!" * 80 + "\n")
if alert_callback is None:
alert_callback = default_alert
print("Critical Event Alert System")
print("=" * 80)
print(f"Monitoring {len(devices)} GPU(s)")
print(f"Critical event types: {len(critical_events)}")
print("\nPress Ctrl+C to stop\n")
readers = []
alert_count = 0
try:
# Create readers
for device in devices:
reader = amdsmi.AmdSmiEventReader(device, critical_events)
readers.append(reader)
while True:
for idx, reader in enumerate(readers):
events = reader.read(timestamp=0, num_elem=10)
for event in events:
alert_count += 1
event_type = event['event']
bdf = amdsmi.amdsmi_get_gpu_device_bdf(devices[idx])
event_info = {
'alert_id': alert_count,
'timestamp': datetime.now().isoformat(),
'gpu_index': idx,
'gpu_bdf': bdf,
'event_type': event_type,
'severity': severity_map.get(event_type, 'UNKNOWN'),
'message': event['message']
}
# Trigger alert callback
alert_callback(event_info)
time.sleep(1)
except KeyboardInterrupt:
print(f"\nAlert system stopped")
print(f"Total alerts: {alert_count}")
finally:
for reader in readers:
reader.stop()
amdsmi.amdsmi_shut_down()
# Example with custom alert handler
def custom_alert_handler(event_info):
"""Custom alert handler - could send email, SMS, etc."""
if event_info['severity'] == 'CRITICAL':
print(f"\n*** CRITICAL ALERT #{event_info['alert_id']} ***")
print(f"GPU {event_info['gpu_index']}: {event_info['event_type']}")
print(f"Action: Notify system administrator")
# Add your notification logic here (email, SMS, webhook, etc.)
else:
print(f"\nWarning: {event_info['event_type']} on GPU {event_info['gpu_index']}")
# Run alerter with custom handler
critical_event_alerter(alert_callback=custom_alert_handler)Combine event monitoring with health checks:
import amdsmi
import time
def health_aware_event_monitor():
"""Monitor events and check GPU health when events occur."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
# Monitor health-related events
health_events = [
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.GPU_PRE_RESET,
amdsmi.AmdSmiEvtNotificationType.GPU_POST_RESET
]
readers = []
for device in devices:
reader = amdsmi.AmdSmiEventReader(device, health_events)
readers.append(reader)
print("Event-Driven GPU Health Monitor\n")
try:
while True:
for idx, reader in enumerate(readers):
device = devices[idx]
events = reader.read(timestamp=0, num_elem=5)
if events:
bdf = amdsmi.amdsmi_get_gpu_device_bdf(device)
print(f"\n--- GPU {idx} ({bdf}) Event Detected ---")
for event in events:
print(f"Event: {event['event']}")
# Perform health check
print("\nGPU Health Status:")
try:
# Check temperature
temp = amdsmi.amdsmi_get_temp_metric(
device,
amdsmi.AmdSmiTemperatureType.EDGE,
amdsmi.AmdSmiTemperatureMetric.CURRENT
)
print(f" Temperature: {temp/1000:.1f}°C")
except:
print(" Temperature: N/A")
try:
# Check power
power = amdsmi.amdsmi_get_power_info(device)
print(f" Power: {power.get('average_socket_power', 0)/1000000:.1f}W")
except:
print(" Power: N/A")
try:
# Check activity
activity = amdsmi.amdsmi_get_gpu_activity(device)
print(f" GFX Activity: {activity.get('gfx_activity', 0)}%")
except:
print(" Activity: N/A")
print()
time.sleep(2)
except KeyboardInterrupt:
print("\nMonitoring stopped")
finally:
for reader in readers:
reader.stop()
amdsmi.amdsmi_shut_down()
# Run health monitor
health_aware_event_monitor()Proper use of event reader as context manager:
import amdsmi
import time
def safe_event_monitoring():
"""Demonstrate safe event monitoring with context managers."""
amdsmi.amdsmi_init()
devices = amdsmi.amdsmi_get_processor_handles()
if not devices:
print("No GPU devices found")
amdsmi.amdsmi_shut_down()
return
device = devices[0]
# Define events to monitor
events_to_monitor = [
amdsmi.AmdSmiEvtNotificationType.THERMAL_THROTTLE,
amdsmi.AmdSmiEvtNotificationType.VMFAULT
]
print("Safe Event Monitoring with Context Manager\n")
# Using context manager ensures proper cleanup
with amdsmi.AmdSmiEventReader(device, events_to_monitor) as reader:
print("Event reader initialized and monitoring...")
for i in range(10): # Monitor for 10 iterations
events = reader.read(timestamp=0, num_elem=10)
if events:
print(f"\nIteration {i+1}: {len(events)} event(s) received")
for event in events:
print(f" {event['event']}: {event['message']}")
else:
print(f"Iteration {i+1}: No events")
time.sleep(1)
# Reader automatically stopped and cleaned up here
print("\nEvent reader automatically cleaned up")
amdsmi.amdsmi_shut_down()
# Run safe monitoring
safe_event_monitoring()AmdSmiEventReader as a context manager (with statement) to ensure proper cleanupnum_elem parameter controls the maximum events returned per read; adjust based on expected event frequencystop() or use context manager to properly release event notification resourcesnum_elem values consume more memory