0
# Custom Kernel Development
1
2
CuPy provides comprehensive facilities for developing custom CUDA kernels, enabling users to write high-performance GPU code directly in Python. This includes element-wise operations, reduction operations, and raw CUDA kernels for specialized computational tasks that require maximum performance or functionality not available in standard CuPy operations.
3
4
## Capabilities
5
6
### Element-wise Kernels
7
8
Element-wise kernels apply operations to individual elements of arrays, supporting broadcasting and multiple input/output arrays.
9
10
```python { .api }
11
class ElementwiseKernel:
12
"""
13
Compile and execute element-wise CUDA kernels from C++ code.
14
15
ElementwiseKernel enables writing custom element-wise operations
16
that can be applied to CuPy arrays with automatic broadcasting
17
and type handling.
18
"""
19
def __init__(self, in_params, out_params, operation, name="kernel", **kwargs):
20
"""
21
Parameters:
22
in_params: str - Input parameter declarations (e.g., "T x, T y")
23
out_params: str - Output parameter declarations (e.g., "T z")
24
operation: str - C++ code for the element-wise operation
25
name: str, optional - Kernel name for debugging
26
reduce_dims: bool, optional - Enable dimension reduction
27
type_preamble: str, optional - Additional type declarations
28
no_return: bool, optional - Kernel doesn't return a value
29
"""
30
31
def __call__(self, *args, **kwargs):
32
"""
33
Execute the kernel on input arrays.
34
35
Parameters:
36
*args: Input arrays matching in_params specification
37
**kwargs: Additional execution parameters
38
"""
39
40
def create_elementwise_kernel(in_params, out_params, operation, name="kernel"):
41
"""
42
Create an element-wise kernel function.
43
44
Parameters:
45
in_params: str - Input parameter declarations
46
out_params: str - Output parameter declarations
47
operation: str - C++ operation code
48
name: str, optional - Kernel name
49
"""
50
```
51
52
### Reduction Kernels
53
54
Reduction kernels perform operations that reduce arrays along specified axes, such as sum, max, min, or custom reduction operations.
55
56
```python { .api }
57
class ReductionKernel:
58
"""
59
Compile and execute reduction CUDA kernels from C++ code.
60
61
ReductionKernel enables writing custom reduction operations
62
that can reduce arrays along specified axes with automatic
63
handling of different reduction strategies.
64
"""
65
def __init__(self, in_params, out_params, map_expr, reduce_expr, post_map_expr=None, **kwargs):
66
"""
67
Parameters:
68
in_params: str - Input parameter declarations
69
out_params: str - Output parameter declarations
70
map_expr: str - Expression applied to each input element
71
reduce_expr: str - Expression for combining mapped values
72
post_map_expr: str, optional - Expression applied after reduction
73
identity: str, optional - Identity value for reduction
74
name: str, optional - Kernel name
75
reduce_type: str, optional - Data type for intermediate reductions
76
type_preamble: str, optional - Additional type declarations
77
"""
78
79
def __call__(self, *args, axis=None, dtype=None, out=None, keepdims=False):
80
"""
81
Execute the reduction kernel.
82
83
Parameters:
84
*args: Input arrays
85
axis: int or tuple, optional - Axes along which to reduce
86
dtype: data-type, optional - Output data type
87
out: ndarray, optional - Output array
88
keepdims: bool, optional - Keep reduced dimensions
89
"""
90
91
def create_reduction_kernel(in_params, out_params, map_expr, reduce_expr, **kwargs):
92
"""
93
Create a reduction kernel function.
94
95
Parameters:
96
in_params: str - Input parameter declarations
97
out_params: str - Output parameter declarations
98
map_expr: str - Mapping expression
99
reduce_expr: str - Reduction expression
100
**kwargs: Additional kernel parameters
101
"""
102
```
103
104
### Raw CUDA Kernels
105
106
Raw kernels provide direct access to CUDA kernel development with full control over thread indexing, shared memory, and GPU programming constructs.
107
108
```python { .api }
109
class RawKernel:
110
"""
111
Compile and execute raw CUDA kernels from C++ source code.
112
113
RawKernel provides complete control over CUDA kernel development,
114
allowing direct manipulation of thread indices, shared memory,
115
and advanced GPU programming patterns.
116
"""
117
def __init__(self, code, name, **kwargs):
118
"""
119
Parameters:
120
code: str - Complete CUDA C++ kernel source code
121
name: str - Kernel function name
122
options: tuple, optional - Compilation options
123
headers: tuple, optional - Header files to include
124
backend: str, optional - Compilation backend ('nvcc', 'nvrtc')
125
translate_cucomplex: bool, optional - Handle complex number types
126
enable_cooperative_groups: bool, optional - Enable cooperative groups
127
jitify: bool, optional - Use Jitify for compilation
128
"""
129
130
def __call__(self, grid, block, args=(), shared_mem=0, stream=None):
131
"""
132
Launch the CUDA kernel.
133
134
Parameters:
135
grid: tuple - Grid dimensions (blocks per grid)
136
block: tuple - Block dimensions (threads per block)
137
args: tuple, optional - Kernel arguments
138
shared_mem: int, optional - Shared memory size in bytes
139
stream: Stream, optional - CUDA stream for execution
140
"""
141
142
@property
143
def attributes(self):
144
"""Get kernel attributes (max threads, registers, etc.)."""
145
146
class RawModule:
147
"""
148
Load and manage CUDA modules containing multiple kernels.
149
150
RawModule allows loading pre-compiled CUDA modules (PTX/CUBIN)
151
and accessing multiple kernel functions within the module.
152
"""
153
def __init__(self, code=None, **kwargs):
154
"""
155
Parameters:
156
code: str or bytes, optional - Module source code or binary
157
cubin: bytes, optional - Pre-compiled CUBIN binary
158
ptx: bytes, optional - PTX assembly code
159
"""
160
161
def get_function(self, name):
162
"""
163
Get a kernel function from the module.
164
165
Parameters:
166
name: str - Function name in the module
167
"""
168
169
def get_global(self, name):
170
"""
171
Get a global variable from the module.
172
173
Parameters:
174
name: str - Global variable name
175
"""
176
177
def compile_with_cache(source, **kwargs):
178
"""
179
Compile CUDA source with caching for faster subsequent uses.
180
181
Parameters:
182
source: str - CUDA C++ source code
183
**kwargs: Compilation options
184
"""
185
```
186
187
### Kernel Utilities
188
189
Utility functions and classes for kernel development and management.
190
191
```python { .api }
192
def memoize(for_each_device=False):
193
"""
194
Decorator for memoizing kernel compilation results.
195
196
Parameters:
197
for_each_device: bool, optional - Cache separately for each device
198
"""
199
200
def clear_memo():
201
"""Clear all memoized kernel compilation cache."""
202
203
class ParameterInfo:
204
"""
205
Information about kernel parameters for type checking and validation.
206
207
Provides metadata about kernel parameter types, shapes, and constraints
208
for automatic validation and optimization.
209
"""
210
def __init__(self, param, is_const=False): ...
211
212
def _get_param_info(s, is_const=False):
213
"""
214
Parse parameter declaration strings into ParameterInfo objects.
215
216
Parameters:
217
s: str - Parameter declaration string
218
is_const: bool, optional - Whether parameter is const
219
"""
220
```
221
222
## Usage Examples
223
224
### Element-wise Kernel Example
225
226
```python
227
import cupy as cp
228
229
# Define a custom element-wise kernel for vector addition with scaling
230
add_scale_kernel = cp.ElementwiseKernel(
231
'T x, T y, T scale', # Input parameters
232
'T z', # Output parameter
233
'z = (x + y) * scale', # Operation
234
'add_scale_kernel' # Kernel name
235
)
236
237
# Create input arrays
238
x = cp.array([1, 2, 3, 4], dtype=cp.float32)
239
y = cp.array([5, 6, 7, 8], dtype=cp.float32)
240
scale = cp.float32(2.0)
241
242
# Execute the kernel
243
result = add_scale_kernel(x, y, scale)
244
print(result) # [12. 16. 20. 24.]
245
246
# Complex element-wise operation with multiple outputs
247
complex_kernel = cp.ElementwiseKernel(
248
'T x, T y',
249
'T sum, T diff, T prod',
250
'''
251
sum = x + y;
252
diff = x - y;
253
prod = x * y;
254
''',
255
'complex_ops'
256
)
257
258
# Multiple outputs
259
x = cp.array([10, 20, 30])
260
y = cp.array([1, 2, 3])
261
sum_out, diff_out, prod_out = complex_kernel(x, y)
262
```
263
264
### Reduction Kernel Example
265
266
```python
267
# Custom reduction kernel for computing variance
268
variance_kernel = cp.ReductionKernel(
269
'T x, T mean', # Input parameters
270
'T variance', # Output parameter
271
'(x - mean) * (x - mean)', # Map expression
272
'a + b', # Reduce expression (sum of squares)
273
'0', # Identity (zero)
274
'variance_kernel'
275
)
276
277
# Compute variance using the custom kernel
278
data = cp.random.normal(0, 1, 1000000)
279
mean = cp.mean(data)
280
variance = variance_kernel(data, mean) / (data.size - 1)
281
282
# Reduction with post-processing
283
rms_kernel = cp.ReductionKernel(
284
'T x',
285
'T rms',
286
'x * x', # Square each element
287
'a + b', # Sum the squares
288
'sqrt(a / _in_ind.size())', # Post-processing: sqrt(sum/n)
289
'0'
290
)
291
292
data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
293
rms_value = rms_kernel(data)
294
```
295
296
### Raw Kernel Example
297
298
```python
299
# Raw CUDA kernel for matrix multiplication (simplified)
300
matmul_kernel_code = r'''
301
extern "C" __global__
302
void matmul_kernel(float* A, float* B, float* C, int M, int N, int K) {
303
int row = blockIdx.y * blockDim.y + threadIdx.y;
304
int col = blockIdx.x * blockDim.x + threadIdx.x;
305
306
if (row < M && col < N) {
307
float sum = 0.0f;
308
for (int k = 0; k < K; k++) {
309
sum += A[row * K + k] * B[k * N + col];
310
}
311
C[row * N + col] = sum;
312
}
313
}
314
'''
315
316
# Compile the kernel
317
matmul_kernel = cp.RawKernel(matmul_kernel_code, 'matmul_kernel')
318
319
# Prepare matrices
320
M, N, K = 1024, 1024, 1024
321
A = cp.random.rand(M, K, dtype=cp.float32)
322
B = cp.random.rand(K, N, dtype=cp.float32)
323
C = cp.zeros((M, N), dtype=cp.float32)
324
325
# Launch kernel with appropriate grid/block dimensions
326
threads_per_block = (16, 16)
327
blocks_per_grid = (
328
(N + threads_per_block[0] - 1) // threads_per_block[0],
329
(M + threads_per_block[1] - 1) // threads_per_block[1]
330
)
331
332
matmul_kernel(
333
blocks_per_grid,
334
threads_per_block,
335
(A, B, C, M, N, K)
336
)
337
338
# Advanced raw kernel with shared memory
339
shared_mem_kernel_code = r'''
340
extern "C" __global__
341
void vector_reduction(float* input, float* output, int n) {
342
extern __shared__ float sdata[];
343
344
unsigned int tid = threadIdx.x;
345
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
346
347
// Load data into shared memory
348
sdata[tid] = (i < n) ? input[i] : 0.0f;
349
__syncthreads();
350
351
// Perform reduction in shared memory
352
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
353
if (tid < s) {
354
sdata[tid] += sdata[tid + s];
355
}
356
__syncthreads();
357
}
358
359
// Write result for this block to global memory
360
if (tid == 0) output[blockIdx.x] = sdata[0];
361
}
362
'''
363
364
shared_kernel = cp.RawKernel(shared_mem_kernel_code, 'vector_reduction')
365
366
# Input data
367
input_data = cp.random.rand(1000000, dtype=cp.float32)
368
block_size = 256
369
grid_size = (input_data.size + block_size - 1) // block_size
370
output = cp.zeros(grid_size, dtype=cp.float32)
371
372
# Launch with shared memory
373
shared_memory_size = block_size * 4 # 4 bytes per float
374
shared_kernel(
375
(grid_size,),
376
(block_size,),
377
(input_data, output, input_data.size),
378
shared_mem=shared_memory_size
379
)
380
381
# Final reduction on CPU or with another kernel
382
final_result = cp.sum(output)
383
```
384
385
### Performance Optimization Examples
386
387
```python
388
# Using kernel caching for better performance
389
@cp.memoize(for_each_device=True)
390
def create_optimized_kernel():
391
return cp.ElementwiseKernel(
392
'T x, T y, T alpha, T beta',
393
'T z',
394
'z = alpha * x + beta * y', # AXPY operation
395
'axpy_kernel',
396
options=('-O3',), # Optimization flags
397
)
398
399
# Reuse compiled kernel multiple times
400
axpy_kernel = create_optimized_kernel()
401
for i in range(100):
402
result = axpy_kernel(x, y, alpha, beta)
403
404
# Template specialization for different data types
405
template_kernel_code = r'''
406
template<typename T>
407
__device__ T complex_function(T x, T y) {
408
return sqrt(x * x + y * y);
409
}
410
411
extern "C" __global__
412
void magnitude_kernel(float* x, float* y, float* result, int n) {
413
int i = blockIdx.x * blockDim.x + threadIdx.x;
414
if (i < n) {
415
result[i] = complex_function(x[i], y[i]);
416
}
417
}
418
'''
419
420
# Compile with template support
421
magnitude_kernel = cp.RawKernel(
422
template_kernel_code,
423
'magnitude_kernel',
424
options=('--std=c++14',)
425
)
426
427
# Custom reduction with atomics for complex reductions
428
atomic_kernel_code = r'''
429
extern "C" __global__
430
void atomic_histogram(int* data, int* hist, int n, int num_bins) {
431
int i = blockIdx.x * blockDim.x + threadIdx.x;
432
if (i < n) {
433
int bin = data[i] % num_bins;
434
atomicAdd(&hist[bin], 1);
435
}
436
}
437
'''
438
439
histogram_kernel = cp.RawKernel(atomic_kernel_code, 'atomic_histogram')
440
441
# Create histogram using atomic operations
442
data = cp.random.randint(0, 100, size=1000000)
443
histogram = cp.zeros(100, dtype=cp.int32)
444
445
threads_per_block = 256
446
blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block
447
448
histogram_kernel(
449
(blocks_per_grid,),
450
(threads_per_block,),
451
(data, histogram, data.size, 100)
452
)
453
```
454
455
## Advanced Features
456
457
### Kernel Fusion and Optimization
458
459
```python
460
# Fused kernel combining multiple operations
461
fused_kernel = cp.ElementwiseKernel(
462
'T x, T y, T z, T alpha, T beta',
463
'T result',
464
'''
465
T temp1 = x * alpha + y * beta; // Linear combination
466
T temp2 = temp1 * temp1; // Square
467
result = sqrt(temp2 + z); // Final computation
468
''',
469
'fused_operations'
470
)
471
472
# Memory coalescing optimization
473
coalesced_kernel_code = r'''
474
extern "C" __global__
475
void coalesced_transpose(float* input, float* output, int width, int height) {
476
__shared__ float tile[32][32];
477
478
int x = blockIdx.x * blockDim.x + threadIdx.x;
479
int y = blockIdx.y * blockDim.y + threadIdx.y;
480
481
// Read from global memory (coalesced)
482
if (x < width && y < height) {
483
tile[threadIdx.y][threadIdx.x] = input[y * width + x];
484
}
485
486
__syncthreads();
487
488
// Calculate transposed coordinates
489
x = blockIdx.y * blockDim.y + threadIdx.x;
490
y = blockIdx.x * blockDim.x + threadIdx.y;
491
492
// Write to global memory (coalesced)
493
if (x < height && y < width) {
494
output[y * height + x] = tile[threadIdx.x][threadIdx.y];
495
}
496
}
497
'''
498
499
transpose_kernel = cp.RawKernel(coalesced_kernel_code, 'coalesced_transpose')
500
```
501
502
### Error Handling and Debugging
503
504
```python
505
# Kernel with error checking
506
debug_kernel_code = r'''
507
extern "C" __global__
508
void debug_kernel(float* input, float* output, int n) {
509
int i = blockIdx.x * blockDim.x + threadIdx.x;
510
511
if (i >= n) return; // Bounds checking
512
513
float value = input[i];
514
515
// Check for invalid values
516
if (isnan(value) || isinf(value)) {
517
printf("Invalid value at index %d: %f\n", i, value);
518
output[i] = 0.0f;
519
} else {
520
output[i] = sqrt(value);
521
}
522
}
523
'''
524
525
debug_kernel = cp.RawKernel(debug_kernel_code, 'debug_kernel')
526
527
# Use with error checking enabled
528
cp.cuda.set_device(0)
529
try:
530
result = debug_kernel((grid_size,), (block_size,), (input_data, output, n))
531
cp.cuda.synchronize() # Ensure completion and flush printf
532
except cp.cuda.runtime.CUDARuntimeError as e:
533
print(f"Kernel execution failed: {e}")
534
```
535
536
Custom kernel development in CuPy provides the foundation for high-performance GPU computing, enabling developers to implement specialized algorithms, optimize critical computational bottlenecks, and access advanced CUDA features while maintaining integration with the broader CuPy ecosystem.