0
# Flink External Resource GPU Driver
1
2
The Flink External Resource GPU Driver provides GPU resource management capabilities for Apache Flink streaming and batch processing jobs. It implements Flink's ExternalResourceDriver interface to enable discovery, allocation, and management of GPU resources across cluster nodes using configurable discovery scripts.
3
4
## Package Information
5
6
- **Package Name**: org.apache.flink:flink-external-resource-gpu
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Installation**: Add to Maven dependencies with groupId `org.apache.flink` and artifactId `flink-external-resource-gpu`
10
11
## Core Imports
12
13
```java
14
import org.apache.flink.externalresource.gpu.GPUDriverFactory;
15
import org.apache.flink.externalresource.gpu.GPUDriverOptions;
16
import org.apache.flink.externalresource.gpu.GPUInfo;
17
import org.apache.flink.configuration.Configuration;
18
```
19
20
## Basic Usage
21
22
```java
23
import org.apache.flink.externalresource.gpu.GPUDriverFactory;
24
import org.apache.flink.externalresource.gpu.GPUDriverOptions;
25
import org.apache.flink.api.common.externalresource.ExternalResourceDriver;
26
import org.apache.flink.configuration.Configuration;
27
import java.util.Set;
28
29
// Configure GPU discovery
30
Configuration config = new Configuration();
31
config.set(GPUDriverOptions.DISCOVERY_SCRIPT_PATH, "/path/to/gpu-discovery-script.sh");
32
config.set(GPUDriverOptions.DISCOVERY_SCRIPT_ARG, "--device-type nvidia");
33
34
// Create GPU driver through factory
35
GPUDriverFactory factory = new GPUDriverFactory();
36
ExternalResourceDriver driver = factory.createExternalResourceDriver(config);
37
38
// Discover GPU resources
39
Set<GPUInfo> gpuResources = driver.retrieveResourceInfo(2L); // Request 2 GPUs
40
41
// Use GPU information
42
for (GPUInfo gpu : gpuResources) {
43
// Get GPU device index (GPUInfo always provides "index" property)
44
String deviceIndex = gpu.getProperty("index").orElse("unknown");
45
System.out.println("Available GPU: " + gpu.toString()); // e.g., "GPU Device(0)"
46
}
47
```
48
49
## Architecture
50
51
The GPU driver is built around several key components:
52
53
- **GPUDriverFactory**: Factory for creating GPU driver instances from configuration
54
- **GPUDriver**: Main driver implementation that executes discovery scripts and manages GPU resources
55
- **GPUInfo**: Value object representing individual GPU devices with their properties
56
- **GPUDriverOptions**: Configuration options for discovery script path and arguments
57
- **Discovery Script Integration**: Executes external scripts to detect available GPU hardware
58
59
## Capabilities
60
61
### GPU Driver Factory
62
63
Factory for creating GPU driver instances with proper configuration validation.
64
65
```java { .api }
66
/**
67
* Factory for creating GPU driver instances
68
*/
69
public class GPUDriverFactory implements ExternalResourceDriverFactory {
70
/**
71
* Creates an external resource driver for GPU management
72
* @param config Configuration containing GPU discovery settings
73
* @return ExternalResourceDriver instance for GPU resources
74
* @throws Exception if configuration is invalid or driver creation fails
75
*/
76
public ExternalResourceDriver createExternalResourceDriver(Configuration config) throws Exception;
77
}
78
```
79
80
### GPU Information
81
82
Represents individual GPU device information including device indices and properties.
83
84
```java { .api }
85
/**
86
* Information container for GPU resource, currently including the GPU index
87
* Note: Constructor is package-private, instances created through GPUDriver.retrieveResourceInfo()
88
*/
89
public class GPUInfo implements ExternalResourceInfo {
90
91
/**
92
* Gets property value by key
93
* @param key Property key to retrieve (supports "index")
94
* @return Optional containing property value, or empty if key not found
95
*/
96
public Optional<String> getProperty(String key);
97
98
/**
99
* Gets all available property keys
100
* @return Collection of available property keys (currently only "index")
101
*/
102
public Collection<String> getKeys();
103
104
/**
105
* String representation of GPU device
106
* @return Formatted string like "GPU Device(0)"
107
*/
108
public String toString();
109
110
/**
111
* Hash code based on GPU index
112
* @return Hash code for this GPU info
113
*/
114
public int hashCode();
115
116
/**
117
* Equality comparison based on GPU index
118
* @param obj Object to compare
119
* @return true if objects represent same GPU device
120
*/
121
public boolean equals(Object obj);
122
}
123
```
124
125
### GPU Driver Configuration
126
127
Configuration options for GPU discovery script path and arguments.
128
129
```java { .api }
130
/**
131
* Configuration options for GPU driver
132
*/
133
@PublicEvolving
134
public class GPUDriverOptions {
135
/**
136
* Configuration option for discovery script path
137
* Key: "discovery-script.path"
138
* Default: "/opt/flink/plugins/external-resource-gpu/nvidia-gpu-discovery.sh" (DEFAULT_FLINK_PLUGINS_DIRS + "/external-resource-gpu/nvidia-gpu-discovery.sh")
139
* Description: Path to GPU discovery script (absolute or relative to FLINK_HOME)
140
*/
141
public static final ConfigOption<String> DISCOVERY_SCRIPT_PATH;
142
143
/**
144
* Configuration option for discovery script arguments
145
* Key: "discovery-script.args"
146
* Default: No default value
147
* Description: Arguments passed to the discovery script
148
*/
149
public static final ConfigOption<String> DISCOVERY_SCRIPT_ARG;
150
}
151
```
152
153
### GPU Resource Discovery
154
155
Core functionality for discovering and retrieving GPU resources through configurable scripts.
156
157
```java { .api }
158
/**
159
* Driver for GPU resource discovery and management
160
* Implements ExternalResourceDriver interface for Flink integration
161
* Note: Constructor is package-private, instances created through GPUDriverFactory
162
*/
163
class GPUDriver implements ExternalResourceDriver {
164
165
/**
166
* Discovers and retrieves GPU resources by executing discovery script
167
* @param gpuAmount Number of GPUs to discover (must be > 0)
168
* @return Unmodifiable set of GPUInfo objects representing discovered GPUs
169
* @throws IllegalArgumentException if gpuAmount <= 0
170
* @throws TimeoutException if discovery script times out (10 second limit)
171
* @throws FlinkException if discovery script exits with non-zero code
172
* @throws FileNotFoundException if discovery script file does not exist
173
* @throws IllegalConfigurationException if discovery script path is not configured
174
*/
175
public Set<GPUInfo> retrieveResourceInfo(long gpuAmount) throws Exception;
176
}
177
```
178
179
## Implementation Details
180
181
The GPU driver uses a 10-second timeout for discovery script execution (defined by private constant DISCOVERY_SCRIPT_TIMEOUT_MS = 10000L) and expects GPU device indices to be identified by the "index" property key. The discovery script execution includes comprehensive error handling and logging for debugging script execution issues.
182
183
Logging behavior:
184
- Successfully discovered GPU resources are logged at INFO level
185
- Script execution warnings (non-zero exit, multiple output lines) are logged at WARN level with stdout/stderr details
186
- Empty indices and whitespace-only indices are automatically filtered out during parsing
187
188
## Types
189
190
```java { .api }
191
// External dependencies from flink-core
192
interface ExternalResourceDriver {
193
Set<? extends ExternalResourceInfo> retrieveResourceInfo(long amount) throws Exception;
194
}
195
196
interface ExternalResourceDriverFactory {
197
ExternalResourceDriver createExternalResourceDriver(Configuration config) throws Exception;
198
}
199
200
interface ExternalResourceInfo {
201
Optional<String> getProperty(String key);
202
Collection<String> getKeys();
203
}
204
205
// Configuration types
206
class Configuration {
207
<T> T get(ConfigOption<T> option);
208
<T> void set(ConfigOption<T> option, T value);
209
}
210
211
class ConfigOption<T> {
212
String key();
213
}
214
```
215
216
## Error Handling
217
218
The GPU driver throws specific exceptions for different error conditions:
219
220
- **IllegalConfigurationException**: Thrown when discovery script path is not configured or is whitespace-only
221
- **FileNotFoundException**: Thrown when the specified discovery script file does not exist
222
- **FlinkException**: Thrown when discovery script is not executable or exits with non-zero return code
223
- **IllegalArgumentException**: Thrown when gpuAmount parameter is <= 0
224
- **TimeoutException**: Thrown when discovery script execution exceeds 10 second timeout
225
226
Configuration and script validation during driver initialization:
227
- Discovery script path is resolved as absolute path if not already absolute, relative to FLINK_HOME (or current directory if FLINK_HOME not set)
228
- Script file existence and executable permissions are verified during GPUDriver construction
229
- If args configuration is not provided, it defaults to null (passed as "null" string to discovery script)
230
231
Discovery script integration expects:
232
- Script to accept two arguments: `gpuAmount` and optional `args`
233
- Script to output comma-separated GPU indices on a single line to stdout
234
- Script to exit with code 0 for success
235
- Script execution to complete within 10 seconds (DISCOVERY_SCRIPT_TIMEOUT_MS)
236
- If script outputs multiple lines, only the first line is processed (others are logged as warnings)
237
238
## Discovery Script Integration
239
240
The driver integrates with external discovery scripts to detect GPU hardware:
241
242
```bash
243
# Example script execution (command format: <script_path> <gpuAmount> <args>)
244
/path/to/discovery-script.sh 2 --device-type nvidia
245
246
# Expected output format (comma-separated indices on single line)
247
0,1
248
249
# If no GPUs found, script should output empty string or just whitespace
250
```
251
252
The discovery script should:
253
1. Accept GPU amount as first argument
254
2. Accept optional configuration arguments as second argument (or "null" if no args configured)
255
3. Output comma-separated GPU device indices to stdout on a single line
256
4. Exit with code 0 on success
257
5. Complete execution within 10 seconds
258
6. Handle whitespace in GPU indices (indices are trimmed during parsing)
259
260
The driver executes the script using Runtime.exec() with command format: `<script_absolute_path> <gpuAmount> <args>`