or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

device-discovery.mderror-handling-ras.mdevent-monitoring.mdhardware-information.mdindex.mdlibrary-management.mdmemory-management.mdpcie-connectivity.mdperformance-control.mdperformance-counters.mdperformance-monitoring.mdprocess-system-info.md

event-monitoring.mddocs/

0

# Event Monitoring

1

2

Asynchronous event notification system for GPU state changes, thermal events, and error conditions. The event system allows applications to monitor GPU status without continuous polling.

3

4

## Capabilities

5

6

### Event Notification Initialization

7

8

Initialize event notification system for a specific GPU processor.

9

10

```c { .api }

11

amdsmi_status_t amdsmi_init_gpu_event_notification(amdsmi_processor_handle processor_handle);

12

```

13

14

**Parameters:**

15

- `processor_handle`: Handle to the GPU processor to monitor

16

17

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

18

19

**Usage Example:**

20

21

```c

22

amdsmi_status_t ret = amdsmi_init_gpu_event_notification(processor);

23

if (ret == AMDSMI_STATUS_SUCCESS) {

24

printf("Event notifications initialized for GPU\n");

25

}

26

```

27

28

### Event Mask Configuration

29

30

Configure which types of events to monitor by setting an event notification mask.

31

32

```c { .api }

33

amdsmi_status_t amdsmi_set_gpu_event_notification_mask(amdsmi_processor_handle processor_handle, uint64_t mask);

34

```

35

36

**Parameters:**

37

- `processor_handle`: Handle to the GPU processor

38

- `mask`: Bitmask of event types to monitor (from `amdsmi_evt_notification_type_t`)

39

40

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

41

42

**Usage Example:**

43

44

```c

45

// Monitor thermal and power events

46

uint64_t event_mask = AMDSMI_EVT_NOTIF_THERMAL_THROTTLE |

47

AMDSMI_EVT_NOTIF_GPU_PRE_RESET |

48

AMDSMI_EVT_NOTIF_GPU_POST_RESET;

49

50

amdsmi_status_t ret = amdsmi_set_gpu_event_notification_mask(processor, event_mask);

51

if (ret == AMDSMI_STATUS_SUCCESS) {

52

printf("Event mask configured\n");

53

}

54

```

55

56

### Event Retrieval

57

58

Get pending event notifications with optional timeout.

59

60

```c { .api }

61

amdsmi_status_t amdsmi_get_gpu_event_notification(int timeout_ms, uint32_t *num_elem, amdsmi_evt_notification_data_t *data);

62

```

63

64

**Parameters:**

65

- `timeout_ms`: Timeout in milliseconds (-1 for blocking, 0 for non-blocking)

66

- `num_elem`: As input, maximum number of events to retrieve. As output, actual number retrieved.

67

- `data`: Pointer to array of event notification structures

68

69

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

70

71

**Usage Example:**

72

73

```c

74

amdsmi_evt_notification_data_t events[10];

75

uint32_t num_events = 10;

76

77

// Wait up to 5 seconds for events

78

amdsmi_status_t ret = amdsmi_get_gpu_event_notification(5000, &num_events, events);

79

if (ret == AMDSMI_STATUS_SUCCESS) {

80

printf("Received %u events\n", num_events);

81

82

for (uint32_t i = 0; i < num_events; i++) {

83

printf("Event %u: Type=%u, Processor=0x%p\n",

84

i, events[i].event, events[i].processor_handle);

85

}

86

}

87

```

88

89

### Event Monitoring Shutdown

90

91

Stop event monitoring and clean up resources for a GPU processor.

92

93

```c { .api }

94

amdsmi_status_t amdsmi_stop_gpu_event_notification(amdsmi_processor_handle processor_handle);

95

```

96

97

**Parameters:**

98

- `processor_handle`: Handle to the GPU processor

99

100

**Returns:** `amdsmi_status_t` - AMDSMI_STATUS_SUCCESS on success, error code on failure

101

102

**Usage Example:**

103

104

```c

105

amdsmi_status_t ret = amdsmi_stop_gpu_event_notification(processor);

106

if (ret == AMDSMI_STATUS_SUCCESS) {

107

printf("Event notifications stopped for GPU\n");

108

}

109

```

110

111

## Python API

112

113

### Event Notification Management

114

115

```python { .api }

116

def amdsmi_init_gpu_event_notification(processor_handle):

117

"""

118

Initialize event notifications for a GPU.

119

120

Args:

121

processor_handle: GPU processor handle

122

123

Raises:

124

AmdSmiException: If initialization fails

125

"""

126

127

def amdsmi_set_gpu_event_notification_mask(processor_handle, mask):

128

"""

129

Set event notification mask.

130

131

Args:

132

processor_handle: GPU processor handle

133

mask (int): Bitmask of event types to monitor

134

135

Raises:

136

AmdSmiException: If mask setting fails

137

"""

138

139

def amdsmi_get_gpu_event_notification(timeout_ms, max_events=10):

140

"""

141

Get pending event notifications.

142

143

Args:

144

timeout_ms (int): Timeout in milliseconds (-1 for blocking)

145

max_events (int): Maximum number of events to retrieve

146

147

Returns:

148

list: List of event notification dictionaries

149

150

Raises:

151

AmdSmiException: If event retrieval fails

152

"""

153

154

def amdsmi_stop_gpu_event_notification(processor_handle):

155

"""

156

Stop event notifications for a GPU.

157

158

Args:

159

processor_handle: GPU processor handle

160

161

Raises:

162

AmdSmiException: If shutdown fails

163

"""

164

```

165

166

**Python Usage Example:**

167

168

```python

169

import amdsmi

170

from amdsmi import AmdSmiEventType

171

172

# Initialize and get GPU

173

amdsmi.amdsmi_init()

174

try:

175

sockets = amdsmi.amdsmi_get_socket_handles()

176

processors = amdsmi.amdsmi_get_processor_handles(sockets[0])

177

gpu = processors[0]

178

179

# Initialize event monitoring

180

amdsmi.amdsmi_init_gpu_event_notification(gpu)

181

182

# Set event mask for thermal and reset events

183

event_mask = (AmdSmiEventType.THERMAL_THROTTLE |

184

AmdSmiEventType.GPU_PRE_RESET |

185

AmdSmiEventType.GPU_POST_RESET)

186

amdsmi.amdsmi_set_gpu_event_notification_mask(gpu, event_mask)

187

188

# Monitor for events (5 second timeout)

189

events = amdsmi.amdsmi_get_gpu_event_notification(5000, max_events=10)

190

191

for event in events:

192

print(f"Event: {event['event_type']} on GPU {event['processor_handle']}")

193

print(f" Message: {event['message']}")

194

195

finally:

196

# Clean up

197

amdsmi.amdsmi_stop_gpu_event_notification(gpu)

198

amdsmi.amdsmi_shut_down()

199

```

200

201

## Types

202

203

### Event Notification Data

204

205

```c { .api }

206

typedef struct {

207

amdsmi_processor_handle processor_handle; // GPU that generated the event

208

amdsmi_evt_notification_type_t event; // Event type

209

char message[AMDSMI_MAX_STRING_LENGTH]; // Event message

210

uint64_t timestamp; // Event timestamp

211

} amdsmi_evt_notification_data_t;

212

```

213

214

### Event Notification Types

215

216

```c { .api }

217

typedef enum {

218

AMDSMI_EVT_NOTIF_NONE = 0, // No event

219

AMDSMI_EVT_NOTIF_VMFAULT, // VM page fault

220

AMDSMI_EVT_NOTIF_THERMAL_THROTTLE, // Thermal throttling

221

AMDSMI_EVT_NOTIF_GPU_PRE_RESET, // GPU pre-reset warning

222

AMDSMI_EVT_NOTIF_GPU_POST_RESET, // GPU post-reset notification

223

AMDSMI_EVT_NOTIF_RING_HANG, // GPU ring hang

224

AMDSMI_EVT_NOTIF_MAX // Maximum event type

225

} amdsmi_evt_notification_type_t;

226

```

227

228

## Event Types

229

230

### Thermal Events

231

- **THERMAL_THROTTLE**: GPU is reducing performance due to thermal limits

232

233

### Reset Events

234

- **GPU_PRE_RESET**: GPU is about to be reset (warning)

235

- **GPU_POST_RESET**: GPU has completed reset operation

236

237

### Error Events

238

- **VMFAULT**: Virtual memory page fault occurred

239

- **RING_HANG**: GPU command ring has stopped responding

240

241

## Important Notes

242

243

1. **Initialization Required**: Event monitoring must be initialized per GPU before use.

244

245

2. **Event Mask**: Only events specified in the mask will be delivered to the application.

246

247

3. **Timeout Behavior**:

248

- `-1`: Block until events are available

249

- `0`: Return immediately (non-blocking)

250

- `>0`: Wait specified milliseconds for events

251

252

4. **Thread Safety**: Event functions are thread-safe but should be called from the same thread that initialized the library.

253

254

5. **Resource Cleanup**: Always call `amdsmi_stop_gpu_event_notification()` to free resources.

255

256

6. **Event Ordering**: Events are delivered in chronological order but multiple events may be batched together.

257

258

7. **Performance Impact**: Event monitoring has minimal performance overhead when properly configured.