Tessl Tile for pypi/cupy-cuda113@9.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md cuda-integration.md cupy-extensions.md custom-kernels.md fft-operations.md index.md linear-algebra.md math-functions.md random-generation.md statistical-functions.md

custom-kernels.mddocs/

0
# Custom Kernels
1

2
User-defined CUDA kernel creation through ElementwiseKernel, ReductionKernel, and RawKernel classes, enabling custom GPU operations and performance-critical computations. These tools allow developers to write custom CUDA code while maintaining CuPy's array interface.
3

4
## Capabilities
5

6
### ElementwiseKernel
7

8
Create custom element-wise operations that apply functions to each element of input arrays.
9

10
```python { .api }
11
class ElementwiseKernel:
12
    """User-defined elementwise kernel for custom element-wise operations.
13
    
14
    Enables creation of custom CUDA kernels that operate element-wise
15
    on input arrays, similar to NumPy universal functions but with
16
    custom GPU-optimized implementations.
17
    """
18
    
19
    def __init__(self, in_params, out_params, operation, name='kernel', **kwargs):
20
        """Initialize elementwise kernel.
21
        
22
        Parameters:
23
        - in_params: str, input parameter specification (e.g., 'T x, T y')
24
        - out_params: str, output parameter specification (e.g., 'T z')
25
        - operation: str, CUDA C++ code for the operation
26
        - name: str, kernel name for debugging
27
        - reduce_dims: bool, whether to reduce dimensions
28
        - type_preamble: str, additional type definitions
29
        - preamble: str, additional CUDA code before kernel
30
        """
31
    
32
    def __call__(self, *args, **kwargs):
33
        """Execute kernel on input arrays.
34
        
35
        Parameters:
36
        - args: input arrays matching in_params specification
37
        - kwargs: additional kernel arguments
38
        
39
        Returns:
40
        cupy.ndarray: output array(s) as specified by out_params
41
        """
42
```
43

44
### ReductionKernel
45

46
Create custom reduction operations that combine array elements along specified axes.
47

48
```python { .api }
49
class ReductionKernel:
50
    """User-defined reduction kernel for custom reduction operations.
51
    
52
    Enables creation of custom CUDA reduction kernels that combine
53
    array elements along axes, similar to NumPy reduction functions
54
    but with custom GPU-optimized implementations.
55
    """
56
    
57
    def __init__(self, in_params, out_params, map_expr, reduce_expr, 
58
                 post_map_expr='', identity=None, name='reduce_kernel', **kwargs):
59
        """Initialize reduction kernel.
60
        
61
        Parameters:
62
        - in_params: str, input parameter specification
63
        - out_params: str, output parameter specification  
64
        - map_expr: str, expression to map input to intermediate values
65
        - reduce_expr: str, expression to reduce intermediate values
66
        - post_map_expr: str, expression to post-process mapped values
67
        - identity: str, identity value for reduction
68
        - name: str, kernel name for debugging
69
        - reduce_type: str, intermediate reduction type
70
        - type_preamble: str, additional type definitions
71
        - preamble: str, additional CUDA code
72
        """
73
    
74
    def __call__(self, *args, **kwargs):
75
        """Execute reduction kernel on input arrays.
76
        
77
        Parameters:
78
        - args: input arrays matching in_params specification
79
        - axis: int or tuple, axes to reduce over
80
        - keepdims: bool, whether to keep reduced dimensions
81
        
82
        Returns:
83
        cupy.ndarray: reduced output array
84
        """
85
```
86

87
### RawKernel
88

89
Create kernels with full control over CUDA code and execution parameters.
90

91
```python { .api }
92
class RawKernel:
93
    """User-defined raw kernel for maximum control over CUDA execution.
94
    
95
    Provides direct access to CUDA kernel launch parameters and
96
    complete control over kernel implementation, suitable for
97
    complex custom algorithms and performance optimization.
98
    """
99
    
100
    def __init__(self, code, name, backend='nvrtc', **kwargs):
101
        """Initialize raw kernel from CUDA source code.
102
        
103
        Parameters:
104
        - code: str, complete CUDA kernel source code
105
        - name: str, kernel function name in source code
106
        - backend: str, compilation backend ('nvrtc' or 'nvcc')
107
        - options: tuple, compiler options
108
        - jitify: bool, whether to use jitify for compilation
109
        - enable_cooperative_groups: bool, enable cooperative groups
110
        """
111
    
112
    def __call__(self, grid, block, args, **kwargs):
113
        """Execute raw kernel with specified launch configuration.
114
        
115
        Parameters:
116
        - grid: tuple, grid dimensions (gridDim)
117
        - block: tuple, block dimensions (blockDim)
118
        - args: tuple, kernel arguments
119
        - shared_mem: int, shared memory size in bytes
120
        - stream: Stream, CUDA stream for execution
121
        """
122
```
123

124
### RawModule
125

126
Load and manage complete CUDA modules with multiple kernels.
127

128
```python { .api }
129
class RawModule:
130
    """User-defined raw module for managing multiple CUDA kernels.
131
    
132
    Enables loading complete CUDA modules containing multiple
133
    kernel functions, constants, and device functions for
134
    complex GPU applications.
135
    """
136
    
137
    def __init__(self, code, backend='nvrtc', **kwargs):
138
        """Initialize raw module from CUDA source code.
139
        
140
        Parameters:
141
        - code: str, complete CUDA module source code
142
        - backend: str, compilation backend ('nvrtc' or 'nvcc')
143
        - options: tuple, compiler options
144
        - name_expressions: list, symbols to extract from module
145
        - jitify: bool, whether to use jitify
146
        """
147
    
148
    def get_function(self, name):
149
        """Get kernel function by name.
150
        
151
        Parameters:
152
        - name: str, kernel function name
153
        
154
        Returns:
155
        RawKernel: kernel function object
156
        """
157
    
158
    def get_global_var(self, name):
159
        """Get global variable by name.
160
        
161
        Parameters:
162
        - name: str, global variable name
163
        
164
        Returns:
165
        int: device pointer to global variable
166
        """
167
```
168

169
### Kernel Compilation and Caching
170

171
Utilities for kernel compilation and performance optimization.
172

173
```python { .api }
174
def memoize(for_each_device=False):
175
    """Decorator to memoize function results for performance.
176
    
177
    Parameters:
178
    - for_each_device: bool, whether to memoize per device
179
    
180
    Returns:
181
    callable: memoized function
182
    """
183

184
def clear_memo():
185
    """Clear memoization cache to free memory."""
186

187
def compile_with_cache(source, filename, dirname=None, **kwargs):
188
    """Compile CUDA source with caching for improved performance.
189
    
190
    Parameters:
191
    - source: str, CUDA source code
192
    - filename: str, source filename for cache key
193
    - dirname: str, directory for cache files
194
    - kwargs: additional compilation options
195
    
196
    Returns:
197
    compiled module object
198
    """
199
```
200

201
### JIT Compilation Interface
202

203
Just-in-time compilation for dynamic kernel generation.
204

205
```python { .api }
206
def rawkernel(mode='python', device=False):
207
    """Decorator for creating raw kernels from Python functions.
208
    
209
    Enables writing CUDA kernels using Python syntax with automatic
210
    compilation to CUDA C++ code.
211
    
212
    Parameters:
213
    - mode: str, compilation mode ('python' or 'cuda')
214
    - device: bool, whether function runs on device
215
    
216
    Returns:
217
    callable: decorated kernel function
218
    """
219
```
220

221
## Usage Examples
222

223
### Basic ElementwiseKernel
224

225
```python
226
import cupy as cp
227

228
# Define custom elementwise operation
229
add_kernel = cp.ElementwiseKernel(
230
    'float32 x, float32 y',     # Input parameters
231
    'float32 z',                # Output parameter
232
    'z = x + y * 2',           # Operation
233
    'custom_add'               # Kernel name
234
)
235

236
# Create input arrays
237
a = cp.array([1, 2, 3, 4], dtype=cp.float32)
238
b = cp.array([5, 6, 7, 8], dtype=cp.float32)
239

240
# Execute kernel
241
result = add_kernel(a, b)
242
print(result)  # [11, 14, 17, 20]
243

244
# More complex elementwise operation
245
complex_kernel = cp.ElementwiseKernel(
246
    'float32 x, float32 y, float32 alpha',
247
    'float32 z',
248
    '''
249
    float temp = x * alpha + y;
250
    z = temp > 0 ? temp : 0;  // ReLU activation
251
    ''',
252
    'relu_transform'
253
)
254

255
result = complex_kernel(a, b, 0.5)
256
```
257

258
### Custom ReductionKernel
259

260
```python
261
import cupy as cp
262

263
# Define custom reduction operation (sum of squares)
264
sum_of_squares = cp.ReductionKernel(
265
    'float32 x',           # Input parameter
266
    'float32 out',         # Output parameter  
267
    'x * x',              # Map expression (square each element)
268
    'a + b',              # Reduce expression (sum)
269
    '0',                  # Identity value
270
    'sum_of_squares'      # Kernel name
271
)
272

273
# Test the kernel
274
data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
275
result = sum_of_squares(data)
276
print(result)  # 55.0 (1² + 2² + 3² + 4² + 5²)
277

278
# Custom reduction with axis support
279
axis_result = sum_of_squares(data.reshape(1, -1), axis=1)
280
print(axis_result)  # [55.]
281

282
# More complex reduction: weighted mean
283
weighted_mean = cp.ReductionKernel(
284
    'float32 x, float32 w',
285
    'float32 out',
286
    'x * w',              # Multiply value by weight
287
    'a + b',              # Sum weighted values
288
    '0',
289
    'weighted_sum'
290
)
291

292
values = cp.array([1, 2, 3, 4], dtype=cp.float32)
293
weights = cp.array([0.1, 0.2, 0.3, 0.4], dtype=cp.float32)
294
weighted_sum = weighted_mean(values, weights)
295
total_weight = cp.sum(weights)
296
mean = weighted_sum / total_weight
297
print(f"Weighted mean: {mean}")
298
```
299

300
### Advanced RawKernel
301

302
```python
303
import cupy as cp
304

305
# Define complex CUDA kernel
306
matrix_multiply_kernel = cp.RawKernel(r'''
307
extern "C" __global__
308
void matrix_multiply(const float* A, const float* B, float* C, 
309
                    int M, int N, int K) {
310
    int row = blockIdx.y * blockDim.y + threadIdx.y;
311
    int col = blockIdx.x * blockDim.x + threadIdx.x;
312
    
313
    if (row < M && col < N) {
314
        float sum = 0.0f;
315
        for (int k = 0; k < K; k++) {
316
            sum += A[row * K + k] * B[k * N + col];
317
        }
318
        C[row * N + col] = sum;
319
    }
320
}
321
''', 'matrix_multiply')
322

323
# Create test matrices
324
M, N, K = 1024, 1024, 512
325
A = cp.random.random((M, K), dtype=cp.float32)
326
B = cp.random.random((K, N), dtype=cp.float32)
327
C = cp.zeros((M, N), dtype=cp.float32)
328

329
# Configure kernel launch
330
block_size = (16, 16)
331
grid_size = ((N + block_size[0] - 1) // block_size[0],
332
             (M + block_size[1] - 1) // block_size[1])
333

334
# Execute kernel
335
matrix_multiply_kernel(
336
    grid_size, block_size,
337
    (A, B, C, M, N, K)  # Kernel arguments
338
)
339

340
# Verify result
341
expected = cp.dot(A, B)
342
print(f"Results match: {cp.allclose(C, expected)}")
343
```
344

345
### RawModule with Multiple Kernels
346

347
```python
348
import cupy as cp
349

350
# Define module with multiple related kernels
351
cuda_module_code = r'''
352
extern "C" {
353

354
__device__ float activation_relu(float x) {
355
    return fmaxf(0.0f, x);
356
}
357

358
__device__ float activation_sigmoid(float x) {
359
    return 1.0f / (1.0f + expf(-x));
360
}
361

362
__global__ void apply_activation(const float* input, float* output, 
363
                                int size, int activation_type) {
364
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
365
    if (idx < size) {
366
        float x = input[idx];
367
        if (activation_type == 0) {
368
            output[idx] = activation_relu(x);
369
        } else if (activation_type == 1) {
370
            output[idx] = activation_sigmoid(x);
371
        }
372
    }
373
}
374

375
__global__ void vector_add(const float* a, const float* b, 
376
                          float* c, int size) {
377
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
378
    if (idx < size) {
379
        c[idx] = a[idx] + b[idx];
380
    }
381
}
382

383
}
384
'''
385

386
# Load module
387
module = cp.RawModule(code=cuda_module_code)
388

389
# Get kernel functions
390
activation_kernel = module.get_function('apply_activation')
391
add_kernel = module.get_function('vector_add')
392

393
# Test activation kernel
394
data = cp.array([-2, -1, 0, 1, 2], dtype=cp.float32)
395
output = cp.zeros_like(data)
396

397
block_size = 256
398
grid_size = (len(data) + block_size - 1) // block_size
399

400
# Apply ReLU (activation_type=0)
401
activation_kernel(
402
    (grid_size,), (block_size,),
403
    (data, output, data.size, 0)
404
)
405
print(f"ReLU: {output}")  # [0, 0, 0, 1, 2]
406

407
# Apply Sigmoid (activation_type=1)
408
activation_kernel(
409
    (grid_size,), (block_size,),
410
    (data, output, data.size, 1)
411
)
412
print(f"Sigmoid: {output}")
413
```
414

415
### Performance Optimization Techniques
416

417
```python
418
import cupy as cp
419
import time
420

421
# Kernel with shared memory optimization
422
optimized_kernel = cp.RawKernel(r'''
423
extern "C" __global__
424
void optimized_reduction(const float* input, float* output, int size) {
425
    extern __shared__ float sdata[];
426
    
427
    int tid = threadIdx.x;
428
    int i = blockIdx.x * blockDim.x + threadIdx.x;
429
    
430
    // Load data into shared memory
431
    sdata[tid] = (i < size) ? input[i] : 0;
432
    __syncthreads();
433
    
434
    // Perform reduction in shared memory
435
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
436
        if (tid < s) {
437
            sdata[tid] += sdata[tid + s];
438
        }
439
        __syncthreads();
440
    }
441
    
442
    // Write result for this block to global memory
443
    if (tid == 0) output[blockIdx.x] = sdata[0];
444
}
445
''', 'optimized_reduction')
446

447
# Benchmark against CuPy's built-in sum
448
data = cp.random.random(1000000, dtype=cp.float32)
449

450
# Time custom kernel
451
block_size = 256
452
grid_size = (data.size + block_size - 1) // block_size
453
output = cp.zeros(grid_size, dtype=cp.float32)
454

455
start_time = time.time()
456
for _ in range(100):
457
    optimized_kernel(
458
        (grid_size,), (block_size,),
459
        (data, output, data.size),
460
        shared_mem=block_size * 4  # 4 bytes per float
461
    )
462
    cp.cuda.Stream.null.synchronize()
463
custom_time = time.time() - start_time
464

465
# Time built-in sum
466
start_time = time.time()
467
for _ in range(100):
468
    builtin_result = cp.sum(data)
469
    cp.cuda.Stream.null.synchronize()
470
builtin_time = time.time() - start_time
471

472
custom_result = cp.sum(output)
473
print(f"Custom kernel time: {custom_time:.4f}s")
474
print(f"Built-in sum time: {builtin_time:.4f}s")
475
print(f"Results match: {cp.allclose(custom_result, builtin_result)}")
476
```
477

478
### Memory-Efficient Patterns
479

480
```python
481
import cupy as cp
482

483
# In-place operation kernel
484
inplace_kernel = cp.ElementwiseKernel(
485
    'float32 x, float32 alpha',
486
    'float32 x',  # Same array for input and output
487
    'x = x * alpha + 1',
488
    'inplace_transform'
489
)
490

491
# Create data
492
data = cp.random.random(1000000, dtype=cp.float32)
493
original_ptr = data.data.ptr
494

495
# Apply in-place transformation
496
inplace_kernel(data, 2.0, data)  # Modify data in-place
497

498
# Verify same memory location
499
print(f"Same memory location: {data.data.ptr == original_ptr}")
500

501
# Kernel with multiple outputs
502
multi_output_kernel = cp.ElementwiseKernel(
503
    'float32 x',
504
    'float32 sin_x, float32 cos_x, float32 tan_x',
505
    '''
506
    sin_x = sinf(x);
507
    cos_x = cosf(x);
508
    tan_x = tanf(x);
509
    ''',
510
    'trig_functions'
511
)
512

513
# Compute multiple trigonometric functions simultaneously
514
angles = cp.linspace(0, 2 * cp.pi, 1000, dtype=cp.float32)
515
sin_vals, cos_vals, tan_vals = multi_output_kernel(angles)
516

517
print(f"Identity check: {cp.allclose(sin_vals**2 + cos_vals**2, 1.0)}")
518
```

Version

Tile

Files

custom-kernels.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

custom-kernels.mddocs/