0
# Event Monitoring
1
2
Asynchronous event notification system for GPU state changes, thermal events, and error conditions. The event system allows applications to monitor GPU status without continuous polling.
3
4
## Capabilities
5
6
### Event Notification Initialization
7
8
Initialize event notification system for a specific GPU processor.
9
10
```c { .api }
11
amdsmi_status_t amdsmi_init_gpu_event_notification(amdsmi_processor_handle processor_handle);
12
```
13
14
**Parameters:**
15
- `processor_handle`: Handle to the GPU processor to monitor
16
17
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
18
19
**Usage Example:**
20
21
```c
22
amdsmi_status_t ret = amdsmi_init_gpu_event_notification(processor);
23
if (ret == AMDSMI_STATUS_SUCCESS) {
24
printf("Event notifications initialized for GPU\n");
25
}
26
```
27
28
### Event Mask Configuration
29
30
Configure which types of events to monitor by setting an event notification mask.
31
32
```c { .api }
33
amdsmi_status_t amdsmi_set_gpu_event_notification_mask(amdsmi_processor_handle processor_handle, uint64_t mask);
34
```
35
36
**Parameters:**
37
- `processor_handle`: Handle to the GPU processor
38
- `mask`: Bitmask of event types to monitor (from `amdsmi_evt_notification_type_t`)
39
40
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
41
42
**Usage Example:**
43
44
```c
45
// Monitor thermal and power events
46
uint64_t event_mask = AMDSMI_EVT_NOTIF_THERMAL_THROTTLE |
47
AMDSMI_EVT_NOTIF_GPU_PRE_RESET |
48
AMDSMI_EVT_NOTIF_GPU_POST_RESET;
49
50
amdsmi_status_t ret = amdsmi_set_gpu_event_notification_mask(processor, event_mask);
51
if (ret == AMDSMI_STATUS_SUCCESS) {
52
printf("Event mask configured\n");
53
}
54
```
55
56
### Event Retrieval
57
58
Get pending event notifications with optional timeout.
59
60
```c { .api }
61
amdsmi_status_t amdsmi_get_gpu_event_notification(int timeout_ms, uint32_t *num_elem, amdsmi_evt_notification_data_t *data);
62
```
63
64
**Parameters:**
65
- `timeout_ms`: Timeout in milliseconds (-1 for blocking, 0 for non-blocking)
66
- `num_elem`: As input, maximum number of events to retrieve. As output, actual number retrieved.
67
- `data`: Pointer to array of event notification structures
68
69
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
70
71
**Usage Example:**
72
73
```c
74
amdsmi_evt_notification_data_t events[10];
75
uint32_t num_events = 10;
76
77
// Wait up to 5 seconds for events
78
amdsmi_status_t ret = amdsmi_get_gpu_event_notification(5000, &num_events, events);
79
if (ret == AMDSMI_STATUS_SUCCESS) {
80
printf("Received %u events\n", num_events);
81
82
for (uint32_t i = 0; i < num_events; i++) {
83
printf("Event %u: Type=%u, Processor=0x%p\n",
84
i, events[i].event, events[i].processor_handle);
85
}
86
}
87
```
88
89
### Event Monitoring Shutdown
90
91
Stop event monitoring and clean up resources for a GPU processor.
92
93
```c { .api }
94
amdsmi_status_t amdsmi_stop_gpu_event_notification(amdsmi_processor_handle processor_handle);
95
```
96
97
**Parameters:**
98
- `processor_handle`: Handle to the GPU processor
99
100
**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure
101
102
**Usage Example:**
103
104
```c
105
amdsmi_status_t ret = amdsmi_stop_gpu_event_notification(processor);
106
if (ret == AMDSMI_STATUS_SUCCESS) {
107
printf("Event notifications stopped for GPU\n");
108
}
109
```
110
111
## Python API
112
113
### Event Notification Management
114
115
```python { .api }
116
def amdsmi_init_gpu_event_notification(processor_handle):
117
"""
118
Initialize event notifications for a GPU.
119
120
Args:
121
processor_handle: GPU processor handle
122
123
Raises:
124
AmdSmiException: If initialization fails
125
"""
126
127
def amdsmi_set_gpu_event_notification_mask(processor_handle, mask):
128
"""
129
Set event notification mask.
130
131
Args:
132
processor_handle: GPU processor handle
133
mask (int): Bitmask of event types to monitor
134
135
Raises:
136
AmdSmiException: If mask setting fails
137
"""
138
139
def amdsmi_get_gpu_event_notification(timeout_ms, max_events=10):
140
"""
141
Get pending event notifications.
142
143
Args:
144
timeout_ms (int): Timeout in milliseconds (-1 for blocking)
145
max_events (int): Maximum number of events to retrieve
146
147
Returns:
148
list: List of event notification dictionaries
149
150
Raises:
151
AmdSmiException: If event retrieval fails
152
"""
153
154
def amdsmi_stop_gpu_event_notification(processor_handle):
155
"""
156
Stop event notifications for a GPU.
157
158
Args:
159
processor_handle: GPU processor handle
160
161
Raises:
162
AmdSmiException: If shutdown fails
163
"""
164
```
165
166
**Python Usage Example:**
167
168
```python
169
import amdsmi
170
from amdsmi import AmdSmiEventType
171
172
# Initialize and get GPU
173
amdsmi.amdsmi_init()
174
try:
175
sockets = amdsmi.amdsmi_get_socket_handles()
176
processors = amdsmi.amdsmi_get_processor_handles(sockets[0])
177
gpu = processors[0]
178
179
# Initialize event monitoring
180
amdsmi.amdsmi_init_gpu_event_notification(gpu)
181
182
# Set event mask for thermal and reset events
183
event_mask = (AmdSmiEventType.THERMAL_THROTTLE |
184
AmdSmiEventType.GPU_PRE_RESET |
185
AmdSmiEventType.GPU_POST_RESET)
186
amdsmi.amdsmi_set_gpu_event_notification_mask(gpu, event_mask)
187
188
# Monitor for events (5 second timeout)
189
events = amdsmi.amdsmi_get_gpu_event_notification(5000, max_events=10)
190
191
for event in events:
192
print(f"Event: {event['event_type']} on GPU {event['processor_handle']}")
193
print(f" Message: {event['message']}")
194
195
finally:
196
# Clean up
197
amdsmi.amdsmi_stop_gpu_event_notification(gpu)
198
amdsmi.amdsmi_shut_down()
199
```
200
201
## Types
202
203
### Event Notification Data
204
205
```c { .api }
206
typedef struct {
207
amdsmi_processor_handle processor_handle; // GPU that generated the event
208
amdsmi_evt_notification_type_t event; // Event type
209
char message[AMDSMI_MAX_STRING_LENGTH]; // Event message
210
uint64_t timestamp; // Event timestamp
211
} amdsmi_evt_notification_data_t;
212
```
213
214
### Event Notification Types
215
216
```c { .api }
217
typedef enum {
218
AMDSMI_EVT_NOTIF_NONE = 0, // No event
219
AMDSMI_EVT_NOTIF_VMFAULT, // VM page fault
220
AMDSMI_EVT_NOTIF_THERMAL_THROTTLE, // Thermal throttling
221
AMDSMI_EVT_NOTIF_GPU_PRE_RESET, // GPU pre-reset warning
222
AMDSMI_EVT_NOTIF_GPU_POST_RESET, // GPU post-reset notification
223
AMDSMI_EVT_NOTIF_RING_HANG, // GPU ring hang
224
AMDSMI_EVT_NOTIF_MAX // Maximum event type
225
} amdsmi_evt_notification_type_t;
226
```
227
228
## Event Types
229
230
### Thermal Events
231
- **THERMAL_THROTTLE**: GPU is reducing performance due to thermal limits
232
233
### Reset Events
234
- **GPU_PRE_RESET**: GPU is about to be reset (warning)
235
- **GPU_POST_RESET**: GPU has completed reset operation
236
237
### Error Events
238
- **VMFAULT**: Virtual memory page fault occurred
239
- **RING_HANG**: GPU command ring has stopped responding
240
241
## Important Notes
242
243
1. **Initialization Required**: Event monitoring must be initialized per GPU before use.
244
245
2. **Event Mask**: Only events specified in the mask will be delivered to the application.
246
247
3. **Timeout Behavior**:
248
- `-1`: Block until events are available
249
- `0`: Return immediately (non-blocking)
250
- `>0`: Wait specified milliseconds for events
251
252
4. **Thread Safety**: Event functions are thread-safe but should be called from the same thread that initialized the library.
253
254
5. **Resource Cleanup**: Always call `amdsmi_stop_gpu_event_notification()` to free resources.
255
256
6. **Event Ordering**: Events are delivered in chronological order but multiple events may be batched together.
257
258
7. **Performance Impact**: Event monitoring has minimal performance overhead when properly configured.