0
# Custom Kernels
1
2
User-defined CUDA kernel creation through ElementwiseKernel, ReductionKernel, and RawKernel classes, enabling custom GPU operations and performance-critical computations. These tools allow developers to write custom CUDA code while maintaining CuPy's array interface.
3
4
## Capabilities
5
6
### ElementwiseKernel
7
8
Create custom element-wise operations that apply functions to each element of input arrays.
9
10
```python { .api }
11
class ElementwiseKernel:
12
"""User-defined elementwise kernel for custom element-wise operations.
13
14
Enables creation of custom CUDA kernels that operate element-wise
15
on input arrays, similar to NumPy universal functions but with
16
custom GPU-optimized implementations.
17
"""
18
19
def __init__(self, in_params, out_params, operation, name='kernel', **kwargs):
20
"""Initialize elementwise kernel.
21
22
Parameters:
23
- in_params: str, input parameter specification (e.g., 'T x, T y')
24
- out_params: str, output parameter specification (e.g., 'T z')
25
- operation: str, CUDA C++ code for the operation
26
- name: str, kernel name for debugging
27
- reduce_dims: bool, whether to reduce dimensions
28
- type_preamble: str, additional type definitions
29
- preamble: str, additional CUDA code before kernel
30
"""
31
32
def __call__(self, *args, **kwargs):
33
"""Execute kernel on input arrays.
34
35
Parameters:
36
- args: input arrays matching in_params specification
37
- kwargs: additional kernel arguments
38
39
Returns:
40
cupy.ndarray: output array(s) as specified by out_params
41
"""
42
```
43
44
### ReductionKernel
45
46
Create custom reduction operations that combine array elements along specified axes.
47
48
```python { .api }
49
class ReductionKernel:
50
"""User-defined reduction kernel for custom reduction operations.
51
52
Enables creation of custom CUDA reduction kernels that combine
53
array elements along axes, similar to NumPy reduction functions
54
but with custom GPU-optimized implementations.
55
"""
56
57
def __init__(self, in_params, out_params, map_expr, reduce_expr,
58
post_map_expr='', identity=None, name='reduce_kernel', **kwargs):
59
"""Initialize reduction kernel.
60
61
Parameters:
62
- in_params: str, input parameter specification
63
- out_params: str, output parameter specification
64
- map_expr: str, expression to map input to intermediate values
65
- reduce_expr: str, expression to reduce intermediate values
66
- post_map_expr: str, expression to post-process mapped values
67
- identity: str, identity value for reduction
68
- name: str, kernel name for debugging
69
- reduce_type: str, intermediate reduction type
70
- type_preamble: str, additional type definitions
71
- preamble: str, additional CUDA code
72
"""
73
74
def __call__(self, *args, **kwargs):
75
"""Execute reduction kernel on input arrays.
76
77
Parameters:
78
- args: input arrays matching in_params specification
79
- axis: int or tuple, axes to reduce over
80
- keepdims: bool, whether to keep reduced dimensions
81
82
Returns:
83
cupy.ndarray: reduced output array
84
"""
85
```
86
87
### RawKernel
88
89
Create kernels with full control over CUDA code and execution parameters.
90
91
```python { .api }
92
class RawKernel:
93
"""User-defined raw kernel for maximum control over CUDA execution.
94
95
Provides direct access to CUDA kernel launch parameters and
96
complete control over kernel implementation, suitable for
97
complex custom algorithms and performance optimization.
98
"""
99
100
def __init__(self, code, name, backend='nvrtc', **kwargs):
101
"""Initialize raw kernel from CUDA source code.
102
103
Parameters:
104
- code: str, complete CUDA kernel source code
105
- name: str, kernel function name in source code
106
- backend: str, compilation backend ('nvrtc' or 'nvcc')
107
- options: tuple, compiler options
108
- jitify: bool, whether to use jitify for compilation
109
- enable_cooperative_groups: bool, enable cooperative groups
110
"""
111
112
def __call__(self, grid, block, args, **kwargs):
113
"""Execute raw kernel with specified launch configuration.
114
115
Parameters:
116
- grid: tuple, grid dimensions (gridDim)
117
- block: tuple, block dimensions (blockDim)
118
- args: tuple, kernel arguments
119
- shared_mem: int, shared memory size in bytes
120
- stream: Stream, CUDA stream for execution
121
"""
122
```
123
124
### RawModule
125
126
Load and manage complete CUDA modules with multiple kernels.
127
128
```python { .api }
129
class RawModule:
130
"""User-defined raw module for managing multiple CUDA kernels.
131
132
Enables loading complete CUDA modules containing multiple
133
kernel functions, constants, and device functions for
134
complex GPU applications.
135
"""
136
137
def __init__(self, code, backend='nvrtc', **kwargs):
138
"""Initialize raw module from CUDA source code.
139
140
Parameters:
141
- code: str, complete CUDA module source code
142
- backend: str, compilation backend ('nvrtc' or 'nvcc')
143
- options: tuple, compiler options
144
- name_expressions: list, symbols to extract from module
145
- jitify: bool, whether to use jitify
146
"""
147
148
def get_function(self, name):
149
"""Get kernel function by name.
150
151
Parameters:
152
- name: str, kernel function name
153
154
Returns:
155
RawKernel: kernel function object
156
"""
157
158
def get_global_var(self, name):
159
"""Get global variable by name.
160
161
Parameters:
162
- name: str, global variable name
163
164
Returns:
165
int: device pointer to global variable
166
"""
167
```
168
169
### Kernel Compilation and Caching
170
171
Utilities for kernel compilation and performance optimization.
172
173
```python { .api }
174
def memoize(for_each_device=False):
175
"""Decorator to memoize function results for performance.
176
177
Parameters:
178
- for_each_device: bool, whether to memoize per device
179
180
Returns:
181
callable: memoized function
182
"""
183
184
def clear_memo():
185
"""Clear memoization cache to free memory."""
186
187
def compile_with_cache(source, filename, dirname=None, **kwargs):
188
"""Compile CUDA source with caching for improved performance.
189
190
Parameters:
191
- source: str, CUDA source code
192
- filename: str, source filename for cache key
193
- dirname: str, directory for cache files
194
- kwargs: additional compilation options
195
196
Returns:
197
compiled module object
198
"""
199
```
200
201
### JIT Compilation Interface
202
203
Just-in-time compilation for dynamic kernel generation.
204
205
```python { .api }
206
def rawkernel(mode='python', device=False):
207
"""Decorator for creating raw kernels from Python functions.
208
209
Enables writing CUDA kernels using Python syntax with automatic
210
compilation to CUDA C++ code.
211
212
Parameters:
213
- mode: str, compilation mode ('python' or 'cuda')
214
- device: bool, whether function runs on device
215
216
Returns:
217
callable: decorated kernel function
218
"""
219
```
220
221
## Usage Examples
222
223
### Basic ElementwiseKernel
224
225
```python
226
import cupy as cp
227
228
# Define custom elementwise operation
229
add_kernel = cp.ElementwiseKernel(
230
'float32 x, float32 y', # Input parameters
231
'float32 z', # Output parameter
232
'z = x + y * 2', # Operation
233
'custom_add' # Kernel name
234
)
235
236
# Create input arrays
237
a = cp.array([1, 2, 3, 4], dtype=cp.float32)
238
b = cp.array([5, 6, 7, 8], dtype=cp.float32)
239
240
# Execute kernel
241
result = add_kernel(a, b)
242
print(result) # [11, 14, 17, 20]
243
244
# More complex elementwise operation
245
complex_kernel = cp.ElementwiseKernel(
246
'float32 x, float32 y, float32 alpha',
247
'float32 z',
248
'''
249
float temp = x * alpha + y;
250
z = temp > 0 ? temp : 0; // ReLU activation
251
''',
252
'relu_transform'
253
)
254
255
result = complex_kernel(a, b, 0.5)
256
```
257
258
### Custom ReductionKernel
259
260
```python
261
import cupy as cp
262
263
# Define custom reduction operation (sum of squares)
264
sum_of_squares = cp.ReductionKernel(
265
'float32 x', # Input parameter
266
'float32 out', # Output parameter
267
'x * x', # Map expression (square each element)
268
'a + b', # Reduce expression (sum)
269
'0', # Identity value
270
'sum_of_squares' # Kernel name
271
)
272
273
# Test the kernel
274
data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
275
result = sum_of_squares(data)
276
print(result) # 55.0 (1² + 2² + 3² + 4² + 5²)
277
278
# Custom reduction with axis support
279
axis_result = sum_of_squares(data.reshape(1, -1), axis=1)
280
print(axis_result) # [55.]
281
282
# More complex reduction: weighted mean
283
weighted_mean = cp.ReductionKernel(
284
'float32 x, float32 w',
285
'float32 out',
286
'x * w', # Multiply value by weight
287
'a + b', # Sum weighted values
288
'0',
289
'weighted_sum'
290
)
291
292
values = cp.array([1, 2, 3, 4], dtype=cp.float32)
293
weights = cp.array([0.1, 0.2, 0.3, 0.4], dtype=cp.float32)
294
weighted_sum = weighted_mean(values, weights)
295
total_weight = cp.sum(weights)
296
mean = weighted_sum / total_weight
297
print(f"Weighted mean: {mean}")
298
```
299
300
### Advanced RawKernel
301
302
```python
303
import cupy as cp
304
305
# Define complex CUDA kernel
306
matrix_multiply_kernel = cp.RawKernel(r'''
307
extern "C" __global__
308
void matrix_multiply(const float* A, const float* B, float* C,
309
int M, int N, int K) {
310
int row = blockIdx.y * blockDim.y + threadIdx.y;
311
int col = blockIdx.x * blockDim.x + threadIdx.x;
312
313
if (row < M && col < N) {
314
float sum = 0.0f;
315
for (int k = 0; k < K; k++) {
316
sum += A[row * K + k] * B[k * N + col];
317
}
318
C[row * N + col] = sum;
319
}
320
}
321
''', 'matrix_multiply')
322
323
# Create test matrices
324
M, N, K = 1024, 1024, 512
325
A = cp.random.random((M, K), dtype=cp.float32)
326
B = cp.random.random((K, N), dtype=cp.float32)
327
C = cp.zeros((M, N), dtype=cp.float32)
328
329
# Configure kernel launch
330
block_size = (16, 16)
331
grid_size = ((N + block_size[0] - 1) // block_size[0],
332
(M + block_size[1] - 1) // block_size[1])
333
334
# Execute kernel
335
matrix_multiply_kernel(
336
grid_size, block_size,
337
(A, B, C, M, N, K) # Kernel arguments
338
)
339
340
# Verify result
341
expected = cp.dot(A, B)
342
print(f"Results match: {cp.allclose(C, expected)}")
343
```
344
345
### RawModule with Multiple Kernels
346
347
```python
348
import cupy as cp
349
350
# Define module with multiple related kernels
351
cuda_module_code = r'''
352
extern "C" {
353
354
__device__ float activation_relu(float x) {
355
return fmaxf(0.0f, x);
356
}
357
358
__device__ float activation_sigmoid(float x) {
359
return 1.0f / (1.0f + expf(-x));
360
}
361
362
__global__ void apply_activation(const float* input, float* output,
363
int size, int activation_type) {
364
int idx = blockIdx.x * blockDim.x + threadIdx.x;
365
if (idx < size) {
366
float x = input[idx];
367
if (activation_type == 0) {
368
output[idx] = activation_relu(x);
369
} else if (activation_type == 1) {
370
output[idx] = activation_sigmoid(x);
371
}
372
}
373
}
374
375
__global__ void vector_add(const float* a, const float* b,
376
float* c, int size) {
377
int idx = blockIdx.x * blockDim.x + threadIdx.x;
378
if (idx < size) {
379
c[idx] = a[idx] + b[idx];
380
}
381
}
382
383
}
384
'''
385
386
# Load module
387
module = cp.RawModule(code=cuda_module_code)
388
389
# Get kernel functions
390
activation_kernel = module.get_function('apply_activation')
391
add_kernel = module.get_function('vector_add')
392
393
# Test activation kernel
394
data = cp.array([-2, -1, 0, 1, 2], dtype=cp.float32)
395
output = cp.zeros_like(data)
396
397
block_size = 256
398
grid_size = (len(data) + block_size - 1) // block_size
399
400
# Apply ReLU (activation_type=0)
401
activation_kernel(
402
(grid_size,), (block_size,),
403
(data, output, data.size, 0)
404
)
405
print(f"ReLU: {output}") # [0, 0, 0, 1, 2]
406
407
# Apply Sigmoid (activation_type=1)
408
activation_kernel(
409
(grid_size,), (block_size,),
410
(data, output, data.size, 1)
411
)
412
print(f"Sigmoid: {output}")
413
```
414
415
### Performance Optimization Techniques
416
417
```python
418
import cupy as cp
419
import time
420
421
# Kernel with shared memory optimization
422
optimized_kernel = cp.RawKernel(r'''
423
extern "C" __global__
424
void optimized_reduction(const float* input, float* output, int size) {
425
extern __shared__ float sdata[];
426
427
int tid = threadIdx.x;
428
int i = blockIdx.x * blockDim.x + threadIdx.x;
429
430
// Load data into shared memory
431
sdata[tid] = (i < size) ? input[i] : 0;
432
__syncthreads();
433
434
// Perform reduction in shared memory
435
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
436
if (tid < s) {
437
sdata[tid] += sdata[tid + s];
438
}
439
__syncthreads();
440
}
441
442
// Write result for this block to global memory
443
if (tid == 0) output[blockIdx.x] = sdata[0];
444
}
445
''', 'optimized_reduction')
446
447
# Benchmark against CuPy's built-in sum
448
data = cp.random.random(1000000, dtype=cp.float32)
449
450
# Time custom kernel
451
block_size = 256
452
grid_size = (data.size + block_size - 1) // block_size
453
output = cp.zeros(grid_size, dtype=cp.float32)
454
455
start_time = time.time()
456
for _ in range(100):
457
optimized_kernel(
458
(grid_size,), (block_size,),
459
(data, output, data.size),
460
shared_mem=block_size * 4 # 4 bytes per float
461
)
462
cp.cuda.Stream.null.synchronize()
463
custom_time = time.time() - start_time
464
465
# Time built-in sum
466
start_time = time.time()
467
for _ in range(100):
468
builtin_result = cp.sum(data)
469
cp.cuda.Stream.null.synchronize()
470
builtin_time = time.time() - start_time
471
472
custom_result = cp.sum(output)
473
print(f"Custom kernel time: {custom_time:.4f}s")
474
print(f"Built-in sum time: {builtin_time:.4f}s")
475
print(f"Results match: {cp.allclose(custom_result, builtin_result)}")
476
```
477
478
### Memory-Efficient Patterns
479
480
```python
481
import cupy as cp
482
483
# In-place operation kernel
484
inplace_kernel = cp.ElementwiseKernel(
485
'float32 x, float32 alpha',
486
'float32 x', # Same array for input and output
487
'x = x * alpha + 1',
488
'inplace_transform'
489
)
490
491
# Create data
492
data = cp.random.random(1000000, dtype=cp.float32)
493
original_ptr = data.data.ptr
494
495
# Apply in-place transformation
496
inplace_kernel(data, 2.0, data) # Modify data in-place
497
498
# Verify same memory location
499
print(f"Same memory location: {data.data.ptr == original_ptr}")
500
501
# Kernel with multiple outputs
502
multi_output_kernel = cp.ElementwiseKernel(
503
'float32 x',
504
'float32 sin_x, float32 cos_x, float32 tan_x',
505
'''
506
sin_x = sinf(x);
507
cos_x = cosf(x);
508
tan_x = tanf(x);
509
''',
510
'trig_functions'
511
)
512
513
# Compute multiple trigonometric functions simultaneously
514
angles = cp.linspace(0, 2 * cp.pi, 1000, dtype=cp.float32)
515
sin_vals, cos_vals, tan_vals = multi_output_kernel(angles)
516
517
print(f"Identity check: {cp.allclose(sin_vals**2 + cos_vals**2, 1.0)}")
518
```