Tessl Tile for pypi/cupy-cuda11x@13.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md cuda-integration.md custom-kernels.md fft.md index.md io-operations.md jit-compilation.md linear-algebra.md mathematical-functions.md performance-profiling.md polynomial-operations.md random.md scipy-extensions.md

custom-kernels.mddocs/

0
# Custom Kernel Development
1

2
CuPy provides comprehensive facilities for developing custom CUDA kernels, enabling users to write high-performance GPU code directly in Python. This includes element-wise operations, reduction operations, and raw CUDA kernels for specialized computational tasks that require maximum performance or functionality not available in standard CuPy operations.
3

4
## Capabilities
5

6
### Element-wise Kernels
7

8
Element-wise kernels apply operations to individual elements of arrays, supporting broadcasting and multiple input/output arrays.
9

10
```python { .api }
11
class ElementwiseKernel:
12
    """
13
    Compile and execute element-wise CUDA kernels from C++ code.
14
    
15
    ElementwiseKernel enables writing custom element-wise operations
16
    that can be applied to CuPy arrays with automatic broadcasting
17
    and type handling.
18
    """
19
    def __init__(self, in_params, out_params, operation, name="kernel", **kwargs):
20
        """
21
        Parameters:
22
            in_params: str - Input parameter declarations (e.g., "T x, T y")
23
            out_params: str - Output parameter declarations (e.g., "T z")
24
            operation: str - C++ code for the element-wise operation
25
            name: str, optional - Kernel name for debugging
26
            reduce_dims: bool, optional - Enable dimension reduction
27
            type_preamble: str, optional - Additional type declarations
28
            no_return: bool, optional - Kernel doesn't return a value
29
        """
30
    
31
    def __call__(self, *args, **kwargs):
32
        """
33
        Execute the kernel on input arrays.
34
        
35
        Parameters:
36
            *args: Input arrays matching in_params specification
37
            **kwargs: Additional execution parameters
38
        """
39

40
def create_elementwise_kernel(in_params, out_params, operation, name="kernel"):
41
    """
42
    Create an element-wise kernel function.
43
    
44
    Parameters:
45
        in_params: str - Input parameter declarations
46
        out_params: str - Output parameter declarations  
47
        operation: str - C++ operation code
48
        name: str, optional - Kernel name
49
    """
50
```
51

52
### Reduction Kernels
53

54
Reduction kernels perform operations that reduce arrays along specified axes, such as sum, max, min, or custom reduction operations.
55

56
```python { .api }
57
class ReductionKernel:
58
    """
59
    Compile and execute reduction CUDA kernels from C++ code.
60
    
61
    ReductionKernel enables writing custom reduction operations
62
    that can reduce arrays along specified axes with automatic
63
    handling of different reduction strategies.
64
    """
65
    def __init__(self, in_params, out_params, map_expr, reduce_expr, post_map_expr=None, **kwargs):
66
        """
67
        Parameters:
68
            in_params: str - Input parameter declarations
69
            out_params: str - Output parameter declarations
70
            map_expr: str - Expression applied to each input element
71
            reduce_expr: str - Expression for combining mapped values
72
            post_map_expr: str, optional - Expression applied after reduction
73
            identity: str, optional - Identity value for reduction
74
            name: str, optional - Kernel name
75
            reduce_type: str, optional - Data type for intermediate reductions
76
            type_preamble: str, optional - Additional type declarations
77
        """
78
    
79
    def __call__(self, *args, axis=None, dtype=None, out=None, keepdims=False):
80
        """
81
        Execute the reduction kernel.
82
        
83
        Parameters:
84
            *args: Input arrays
85
            axis: int or tuple, optional - Axes along which to reduce
86
            dtype: data-type, optional - Output data type
87
            out: ndarray, optional - Output array
88
            keepdims: bool, optional - Keep reduced dimensions
89
        """
90

91
def create_reduction_kernel(in_params, out_params, map_expr, reduce_expr, **kwargs):
92
    """
93
    Create a reduction kernel function.
94
    
95
    Parameters:
96
        in_params: str - Input parameter declarations
97
        out_params: str - Output parameter declarations
98
        map_expr: str - Mapping expression
99
        reduce_expr: str - Reduction expression
100
        **kwargs: Additional kernel parameters
101
    """
102
```
103

104
### Raw CUDA Kernels
105

106
Raw kernels provide direct access to CUDA kernel development with full control over thread indexing, shared memory, and GPU programming constructs.
107

108
```python { .api }
109
class RawKernel:
110
    """
111
    Compile and execute raw CUDA kernels from C++ source code.
112
    
113
    RawKernel provides complete control over CUDA kernel development,
114
    allowing direct manipulation of thread indices, shared memory,
115
    and advanced GPU programming patterns.
116
    """
117
    def __init__(self, code, name, **kwargs):
118
        """
119
        Parameters:
120
            code: str - Complete CUDA C++ kernel source code
121
            name: str - Kernel function name
122
            options: tuple, optional - Compilation options
123
            headers: tuple, optional - Header files to include
124
            backend: str, optional - Compilation backend ('nvcc', 'nvrtc')
125
            translate_cucomplex: bool, optional - Handle complex number types
126
            enable_cooperative_groups: bool, optional - Enable cooperative groups
127
            jitify: bool, optional - Use Jitify for compilation
128
        """
129
    
130
    def __call__(self, grid, block, args=(), shared_mem=0, stream=None):
131
        """
132
        Launch the CUDA kernel.
133
        
134
        Parameters:
135
            grid: tuple - Grid dimensions (blocks per grid)
136
            block: tuple - Block dimensions (threads per block)
137
            args: tuple, optional - Kernel arguments
138
            shared_mem: int, optional - Shared memory size in bytes
139
            stream: Stream, optional - CUDA stream for execution
140
        """
141
    
142
    @property
143
    def attributes(self):
144
        """Get kernel attributes (max threads, registers, etc.)."""
145

146
class RawModule:
147
    """
148
    Load and manage CUDA modules containing multiple kernels.
149
    
150
    RawModule allows loading pre-compiled CUDA modules (PTX/CUBIN)
151
    and accessing multiple kernel functions within the module.
152
    """
153
    def __init__(self, code=None, **kwargs):
154
        """
155
        Parameters:
156
            code: str or bytes, optional - Module source code or binary
157
            cubin: bytes, optional - Pre-compiled CUBIN binary
158
            ptx: bytes, optional - PTX assembly code
159
        """
160
    
161
    def get_function(self, name):
162
        """
163
        Get a kernel function from the module.
164
        
165
        Parameters:
166
            name: str - Function name in the module
167
        """
168
    
169
    def get_global(self, name):
170
        """
171
        Get a global variable from the module.
172
        
173
        Parameters:
174
            name: str - Global variable name
175
        """
176

177
def compile_with_cache(source, **kwargs):
178
    """
179
    Compile CUDA source with caching for faster subsequent uses.
180
    
181
    Parameters:
182
        source: str - CUDA C++ source code
183
        **kwargs: Compilation options
184
    """
185
```
186

187
### Kernel Utilities
188

189
Utility functions and classes for kernel development and management.
190

191
```python { .api }
192
def memoize(for_each_device=False):
193
    """
194
    Decorator for memoizing kernel compilation results.
195
    
196
    Parameters:
197
        for_each_device: bool, optional - Cache separately for each device
198
    """
199

200
def clear_memo():
201
    """Clear all memoized kernel compilation cache."""
202

203
class ParameterInfo:
204
    """
205
    Information about kernel parameters for type checking and validation.
206
    
207
    Provides metadata about kernel parameter types, shapes, and constraints
208
    for automatic validation and optimization.
209
    """
210
    def __init__(self, param, is_const=False): ...
211

212
def _get_param_info(s, is_const=False):
213
    """
214
    Parse parameter declaration strings into ParameterInfo objects.
215
    
216
    Parameters:
217
        s: str - Parameter declaration string
218
        is_const: bool, optional - Whether parameter is const
219
    """
220
```
221

222
## Usage Examples
223

224
### Element-wise Kernel Example
225

226
```python
227
import cupy as cp
228

229
# Define a custom element-wise kernel for vector addition with scaling
230
add_scale_kernel = cp.ElementwiseKernel(
231
    'T x, T y, T scale',  # Input parameters
232
    'T z',                # Output parameter
233
    'z = (x + y) * scale', # Operation
234
    'add_scale_kernel'    # Kernel name
235
)
236

237
# Create input arrays
238
x = cp.array([1, 2, 3, 4], dtype=cp.float32)
239
y = cp.array([5, 6, 7, 8], dtype=cp.float32)
240
scale = cp.float32(2.0)
241

242
# Execute the kernel
243
result = add_scale_kernel(x, y, scale)
244
print(result)  # [12. 16. 20. 24.]
245

246
# Complex element-wise operation with multiple outputs
247
complex_kernel = cp.ElementwiseKernel(
248
    'T x, T y',
249
    'T sum, T diff, T prod',
250
    '''
251
    sum = x + y;
252
    diff = x - y;
253
    prod = x * y;
254
    ''',
255
    'complex_ops'
256
)
257

258
# Multiple outputs
259
x = cp.array([10, 20, 30])
260
y = cp.array([1, 2, 3])
261
sum_out, diff_out, prod_out = complex_kernel(x, y)
262
```
263

264
### Reduction Kernel Example
265

266
```python
267
# Custom reduction kernel for computing variance
268
variance_kernel = cp.ReductionKernel(
269
    'T x, T mean',           # Input parameters
270
    'T variance',            # Output parameter
271
    '(x - mean) * (x - mean)', # Map expression
272
    'a + b',                 # Reduce expression (sum of squares)
273
    '0',                     # Identity (zero)
274
    'variance_kernel'
275
)
276

277
# Compute variance using the custom kernel
278
data = cp.random.normal(0, 1, 1000000)
279
mean = cp.mean(data)
280
variance = variance_kernel(data, mean) / (data.size - 1)
281

282
# Reduction with post-processing
283
rms_kernel = cp.ReductionKernel(
284
    'T x',
285
    'T rms',
286
    'x * x',              # Square each element
287
    'a + b',              # Sum the squares
288
    'sqrt(a / _in_ind.size())',  # Post-processing: sqrt(sum/n)
289
    '0'
290
)
291

292
data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
293
rms_value = rms_kernel(data)
294
```
295

296
### Raw Kernel Example
297

298
```python
299
# Raw CUDA kernel for matrix multiplication (simplified)
300
matmul_kernel_code = r'''
301
extern "C" __global__
302
void matmul_kernel(float* A, float* B, float* C, int M, int N, int K) {
303
    int row = blockIdx.y * blockDim.y + threadIdx.y;
304
    int col = blockIdx.x * blockDim.x + threadIdx.x;
305
    
306
    if (row < M && col < N) {
307
        float sum = 0.0f;
308
        for (int k = 0; k < K; k++) {
309
            sum += A[row * K + k] * B[k * N + col];
310
        }
311
        C[row * N + col] = sum;
312
    }
313
}
314
'''
315

316
# Compile the kernel
317
matmul_kernel = cp.RawKernel(matmul_kernel_code, 'matmul_kernel')
318

319
# Prepare matrices
320
M, N, K = 1024, 1024, 1024
321
A = cp.random.rand(M, K, dtype=cp.float32)
322
B = cp.random.rand(K, N, dtype=cp.float32)
323
C = cp.zeros((M, N), dtype=cp.float32)
324

325
# Launch kernel with appropriate grid/block dimensions
326
threads_per_block = (16, 16)
327
blocks_per_grid = (
328
    (N + threads_per_block[0] - 1) // threads_per_block[0],
329
    (M + threads_per_block[1] - 1) // threads_per_block[1]
330
)
331

332
matmul_kernel(
333
    blocks_per_grid,
334
    threads_per_block,
335
    (A, B, C, M, N, K)
336
)
337

338
# Advanced raw kernel with shared memory
339
shared_mem_kernel_code = r'''
340
extern "C" __global__
341
void vector_reduction(float* input, float* output, int n) {
342
    extern __shared__ float sdata[];
343
    
344
    unsigned int tid = threadIdx.x;
345
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
346
    
347
    // Load data into shared memory
348
    sdata[tid] = (i < n) ? input[i] : 0.0f;
349
    __syncthreads();
350
    
351
    // Perform reduction in shared memory
352
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
353
        if (tid < s) {
354
            sdata[tid] += sdata[tid + s];
355
        }
356
        __syncthreads();
357
    }
358
    
359
    // Write result for this block to global memory
360
    if (tid == 0) output[blockIdx.x] = sdata[0];
361
}
362
'''
363

364
shared_kernel = cp.RawKernel(shared_mem_kernel_code, 'vector_reduction')
365

366
# Input data
367
input_data = cp.random.rand(1000000, dtype=cp.float32)
368
block_size = 256
369
grid_size = (input_data.size + block_size - 1) // block_size
370
output = cp.zeros(grid_size, dtype=cp.float32)
371

372
# Launch with shared memory
373
shared_memory_size = block_size * 4  # 4 bytes per float
374
shared_kernel(
375
    (grid_size,),
376
    (block_size,),
377
    (input_data, output, input_data.size),
378
    shared_mem=shared_memory_size
379
)
380

381
# Final reduction on CPU or with another kernel
382
final_result = cp.sum(output)
383
```
384

385
### Performance Optimization Examples
386

387
```python
388
# Using kernel caching for better performance
389
@cp.memoize(for_each_device=True)
390
def create_optimized_kernel():
391
    return cp.ElementwiseKernel(
392
        'T x, T y, T alpha, T beta',
393
        'T z',
394
        'z = alpha * x + beta * y',  # AXPY operation
395
        'axpy_kernel',
396
        options=('-O3',),  # Optimization flags
397
    )
398

399
# Reuse compiled kernel multiple times
400
axpy_kernel = create_optimized_kernel()
401
for i in range(100):
402
    result = axpy_kernel(x, y, alpha, beta)
403

404
# Template specialization for different data types
405
template_kernel_code = r'''
406
template<typename T>
407
__device__ T complex_function(T x, T y) {
408
    return sqrt(x * x + y * y);
409
}
410

411
extern "C" __global__
412
void magnitude_kernel(float* x, float* y, float* result, int n) {
413
    int i = blockIdx.x * blockDim.x + threadIdx.x;
414
    if (i < n) {
415
        result[i] = complex_function(x[i], y[i]);
416
    }
417
}
418
'''
419

420
# Compile with template support
421
magnitude_kernel = cp.RawKernel(
422
    template_kernel_code,
423
    'magnitude_kernel',
424
    options=('--std=c++14',)
425
)
426

427
# Custom reduction with atomics for complex reductions
428
atomic_kernel_code = r'''
429
extern "C" __global__
430
void atomic_histogram(int* data, int* hist, int n, int num_bins) {
431
    int i = blockIdx.x * blockDim.x + threadIdx.x;
432
    if (i < n) {
433
        int bin = data[i] % num_bins;
434
        atomicAdd(&hist[bin], 1);
435
    }
436
}
437
'''
438

439
histogram_kernel = cp.RawKernel(atomic_kernel_code, 'atomic_histogram')
440

441
# Create histogram using atomic operations
442
data = cp.random.randint(0, 100, size=1000000)
443
histogram = cp.zeros(100, dtype=cp.int32)
444

445
threads_per_block = 256
446
blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block
447

448
histogram_kernel(
449
    (blocks_per_grid,),
450
    (threads_per_block,),
451
    (data, histogram, data.size, 100)
452
)
453
```
454

455
## Advanced Features
456

457
### Kernel Fusion and Optimization
458

459
```python
460
# Fused kernel combining multiple operations
461
fused_kernel = cp.ElementwiseKernel(
462
    'T x, T y, T z, T alpha, T beta',
463
    'T result',
464
    '''
465
    T temp1 = x * alpha + y * beta;  // Linear combination
466
    T temp2 = temp1 * temp1;         // Square
467
    result = sqrt(temp2 + z);        // Final computation
468
    ''',
469
    'fused_operations'
470
)
471

472
# Memory coalescing optimization
473
coalesced_kernel_code = r'''
474
extern "C" __global__
475
void coalesced_transpose(float* input, float* output, int width, int height) {
476
    __shared__ float tile[32][32];
477
    
478
    int x = blockIdx.x * blockDim.x + threadIdx.x;
479
    int y = blockIdx.y * blockDim.y + threadIdx.y;
480
    
481
    // Read from global memory (coalesced)
482
    if (x < width && y < height) {
483
        tile[threadIdx.y][threadIdx.x] = input[y * width + x];
484
    }
485
    
486
    __syncthreads();
487
    
488
    // Calculate transposed coordinates
489
    x = blockIdx.y * blockDim.y + threadIdx.x;
490
    y = blockIdx.x * blockDim.x + threadIdx.y;
491
    
492
    // Write to global memory (coalesced)
493
    if (x < height && y < width) {
494
        output[y * height + x] = tile[threadIdx.x][threadIdx.y];
495
    }
496
}
497
'''
498

499
transpose_kernel = cp.RawKernel(coalesced_kernel_code, 'coalesced_transpose')
500
```
501

502
### Error Handling and Debugging
503

504
```python
505
# Kernel with error checking
506
debug_kernel_code = r'''
507
extern "C" __global__
508
void debug_kernel(float* input, float* output, int n) {
509
    int i = blockIdx.x * blockDim.x + threadIdx.x;
510
    
511
    if (i >= n) return;  // Bounds checking
512
    
513
    float value = input[i];
514
    
515
    // Check for invalid values
516
    if (isnan(value) || isinf(value)) {
517
        printf("Invalid value at index %d: %f\n", i, value);
518
        output[i] = 0.0f;
519
    } else {
520
        output[i] = sqrt(value);
521
    }
522
}
523
'''
524

525
debug_kernel = cp.RawKernel(debug_kernel_code, 'debug_kernel')
526

527
# Use with error checking enabled
528
cp.cuda.set_device(0)
529
try:
530
    result = debug_kernel((grid_size,), (block_size,), (input_data, output, n))
531
    cp.cuda.synchronize()  # Ensure completion and flush printf
532
except cp.cuda.runtime.CUDARuntimeError as e:
533
    print(f"Kernel execution failed: {e}")
534
```
535

536
Custom kernel development in CuPy provides the foundation for high-performance GPU computing, enabling developers to implement specialized algorithms, optimize critical computational bottlenecks, and access advanced CUDA features while maintaining integration with the broader CuPy ecosystem.

Version

Tile

Files

custom-kernels.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

custom-kernels.mddocs/