Tessl Tile for pypi/cupy-cuda110@12.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md cuda-interface.md custom-kernels.md index.md linear-algebra.md mathematical-functions.md random-generation.md scipy-extensions.md statistics.md

custom-kernels.mddocs/

0
# Custom Kernels
1

2
Advanced kernel creation mechanisms for implementing custom GPU operations using CUDA C/C++ code or element-wise operations. Enables high-performance custom computations that go beyond built-in CuPy functions.
3

4
## Capabilities
5

6
### ElementwiseKernel
7

8
Create custom element-wise operations that apply functions to array elements in parallel.
9

10
```python { .api }
11
class ElementwiseKernel:
12
    """
13
    Custom element-wise kernel for parallel array operations.
14
    
15
    Parameters:
16
    - in_params: str, input parameter specification
17
    - out_params: str, output parameter specification  
18
    - operation: str, CUDA C code for element-wise operation
19
    - name: str, kernel name
20
    - options: tuple, NVCC compiler options
21
    - preamble: str, code to prepend to kernel
22
    - loop_prep: str, code before main loop
23
    - after_loop: str, code after main loop
24
    """
25
    def __init__(self, in_params, out_params, operation, name='kernel', 
26
                 options=(), preamble='', loop_prep='', after_loop=''): ...
27
    
28
    def __call__(self, *args, **kwargs):
29
        """
30
        Execute kernel on input arrays.
31
        
32
        Parameters:
33
        - *args: input arrays matching in_params specification
34
        - size: int, number of elements to process
35
        - stream: cupy.cuda.Stream, CUDA stream for execution
36
        
37
        Returns:
38
        cupy.ndarray or tuple: Output arrays
39
        """
40
```
41

42
### RawKernel
43

44
Create kernels from raw CUDA C/C++ source code for maximum flexibility and performance.
45

46
```python { .api }
47
class RawKernel:
48
    """
49
    Raw CUDA kernel from C/C++ source code.
50
    
51
    Parameters:
52
    - code: str, CUDA C/C++ source code
53
    - name: str, kernel function name in source code
54
    - options: tuple, NVCC compiler options
55
    - backend: str, compilation backend ('nvcc' or 'nvrtc')
56
    - translate_cucomplex: bool, translate cuComplex types
57
    """
58
    def __init__(self, code, name, options=(), backend='nvcc', translate_cucomplex=True): ...
59
    
60
    def __call__(self, grid, block, args, **kwargs):
61
        """
62
        Launch kernel with specified grid and block dimensions.
63
        
64
        Parameters:
65
        - grid: tuple, grid dimensions (blocks)
66
        - block: tuple, block dimensions (threads per block)
67
        - args: tuple, kernel arguments
68
        - shared_mem: int, shared memory size in bytes
69
        - stream: cupy.cuda.Stream, CUDA stream for execution
70
        """
71
```
72

73
### RawModule
74

75
Load and manage CUDA modules containing multiple kernels and device functions.
76

77
```python { .api }
78
class RawModule:
79
    """
80
    CUDA module containing multiple functions.
81
    
82
    Parameters:
83
    - code: str, CUDA C/C++ source code
84
    - options: tuple, NVCC compiler options
85
    - backend: str, compilation backend
86
    - translate_cucomplex: bool, translate cuComplex types
87
    """
88
    def __init__(self, code, options=(), backend='nvcc', translate_cucomplex=True): ...
89
    
90
    def get_function(self, name):
91
        """
92
        Get kernel function by name.
93
        
94
        Parameters:
95
        - name: str, function name
96
        
97
        Returns:
98
        RawKernel: Kernel function object
99
        """
100
```
101

102
### ReductionKernel
103

104
Create custom reduction operations that aggregate array elements using associative operations.
105

106
```python { .api }
107
class ReductionKernel:
108
    """
109
    Custom reduction kernel for parallel aggregation operations.
110
    
111
    Parameters:
112
    - in_params: str, input parameter specification
113
    - out_params: str, output parameter specification
114
    - map_expr: str, mapping expression applied to each element
115
    - reduce_expr: str, reduction expression for combining values
116
    - post_map_expr: str, expression applied after mapping
117
    - identity: str, identity value for reduction
118
    - name: str, kernel name
119
    - reduce_type: str, intermediate reduction data type
120
    - options: tuple, NVCC compiler options
121
    - preamble: str, code to prepend to kernel
122
    """
123
    def __init__(self, in_params, out_params, map_expr, reduce_expr,
124
                 post_map_expr='', identity='', name='kernel', 
125
                 reduce_type=None, options=(), preamble=''): ...
126
    
127
    def __call__(self, *args, **kwargs):
128
        """
129
        Execute reduction kernel.
130
        
131
        Parameters:
132
        - *args: input arrays matching in_params specification
133
        - axis: int/tuple, axis along which to reduce
134
        - keepdims: bool, keep reduced dimensions
135
        - stream: cupy.cuda.Stream, CUDA stream for execution
136
        
137
        Returns:
138
        cupy.ndarray: Reduced result
139
        """
140
```
141

142
### Fusion
143

144
Fuse multiple operations into single kernels for improved performance.
145

146
```python { .api }
147
def fuse(*args, **kwargs):
148
    """
149
    Decorator for fusing multiple CuPy operations.
150
    
151
    Parameters:
152
    - kernel_name: str, name for fused kernel
153
    
154
    Returns:
155
    function: Fused function that executes as single kernel
156
    """
157

158
@fuse()
159
def fused_function(x, y):
160
    """Example fused function combining multiple operations."""
161
    return cp.sqrt(x**2 + y**2) * cp.sin(x + y)
162
```
163

164
## Usage Examples
165

166
### ElementwiseKernel Examples
167

168
```python
169
import cupy as cp
170

171
# Simple element-wise operation
172
add_kernel = cp.ElementwiseKernel(
173
    'float32 x, float32 y',  # Input parameters
174
    'float32 z',             # Output parameters  
175
    'z = x + y',             # Operation
176
    'add_kernel'             # Kernel name
177
)
178

179
# Use the kernel
180
a = cp.random.random((1000, 1000), dtype=cp.float32)
181
b = cp.random.random((1000, 1000), dtype=cp.float32)
182
result = add_kernel(a, b)
183

184
# More complex element-wise operation
185
complex_kernel = cp.ElementwiseKernel(
186
    'float32 x, float32 y, float32 alpha',
187
    'float32 z',
188
    '''
189
    float32 temp = x * x + y * y;
190
    z = alpha * sqrt(temp) + sin(x + y);
191
    ''',
192
    'complex_kernel'
193
)
194

195
result = complex_kernel(a, b, 2.5)
196
```
197

198
### RawKernel Examples
199

200
```python
201
# Matrix addition kernel
202
matrix_add_code = '''
203
extern "C" __global__
204
void matrix_add(float* a, float* b, float* c, int n) {
205
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
206
    if (idx < n) {
207
        c[idx] = a[idx] + b[idx];
208
    }
209
}
210
'''
211

212
matrix_add_kernel = cp.RawKernel(matrix_add_code, 'matrix_add')
213

214
# Launch kernel
215
n = 1000000
216
a_gpu = cp.random.random(n, dtype=cp.float32)
217
b_gpu = cp.random.random(n, dtype=cp.float32)  
218
c_gpu = cp.zeros(n, dtype=cp.float32)
219

220
# Calculate grid and block dimensions
221
block_size = 256
222
grid_size = (n + block_size - 1) // block_size
223

224
matrix_add_kernel((grid_size,), (block_size,), (a_gpu, b_gpu, c_gpu, n))
225

226
# More advanced kernel with shared memory
227
shared_memory_code = '''
228
extern "C" __global__
229
void reduce_sum(float* input, float* output, int n) {
230
    extern __shared__ float shared_data[];
231
    
232
    int tid = threadIdx.x;
233
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
234
    
235
    // Load data into shared memory
236
    shared_data[tid] = (idx < n) ? input[idx] : 0.0f;
237
    __syncthreads();
238
    
239
    // Perform reduction in shared memory
240
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
241
        if (tid < stride) {
242
            shared_data[tid] += shared_data[tid + stride];
243
        }
244
        __syncthreads();
245
    }
246
    
247
    // Write result
248
    if (tid == 0) {
249
        output[blockIdx.x] = shared_data[0];
250
    }
251
}
252
'''
253

254
reduce_kernel = cp.RawKernel(shared_memory_code, 'reduce_sum')
255

256
# Use kernel with shared memory
257
input_data = cp.random.random(1000000, dtype=cp.float32)
258
output_size = (len(input_data) + block_size - 1) // block_size
259
output_data = cp.zeros(output_size, dtype=cp.float32)
260

261
shared_mem_size = block_size * 4  # 4 bytes per float
262
reduce_kernel((output_size,), (block_size,), (input_data, output_data, len(input_data)),
263
              shared_mem=shared_mem_size)
264
```
265

266
### ReductionKernel Examples
267

268
```python
269
# Custom sum reduction
270
sum_kernel = cp.ReductionKernel(
271
    'T x',           # Input parameter
272
    'T y',           # Output parameter  
273
    'x',             # Map expression (identity)
274
    'a + b',         # Reduction expression
275
    '0',             # Identity value
276
    'sum_kernel'     # Kernel name
277
)
278

279
data = cp.random.random((1000, 1000))
280
total = sum_kernel(data)
281
row_sums = sum_kernel(data, axis=1)
282

283
# Custom standard deviation reduction
284
std_kernel = cp.ReductionKernel(
285
    'T x, T mean',   # Input parameters
286
    'T y',           # Output parameter
287
    '(x - mean) * (x - mean)',  # Map expression (squared differences)
288
    'a + b',         # Reduction expression (sum)
289
    '0',             # Identity value
290
    'std_kernel'     # Kernel name
291
)
292

293
# Calculate standard deviation
294
data = cp.random.normal(0, 1, (1000, 1000))
295
mean_val = cp.mean(data, axis=1, keepdims=True)
296
variance = std_kernel(data, mean_val, axis=1) / (data.shape[1] - 1)
297
std_dev = cp.sqrt(variance)
298
```
299

300
### RawModule Examples
301

302
```python
303
# Module with multiple functions
304
module_code = '''
305
extern "C" {
306

307
__device__ float square(float x) {
308
    return x * x;
309
}
310

311
__global__ void vector_norm(float* input, float* output, int n) {
312
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
313
    if (idx < n) {
314
        output[idx] = sqrt(square(input[idx]));
315
    }
316
}
317

318
__global__ void vector_scale(float* input, float* output, float scale, int n) {
319
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
320
    if (idx < n) {
321
        output[idx] = input[idx] * scale;
322
    }
323
}
324

325
}
326
'''
327

328
module = cp.RawModule(code=module_code)
329
norm_kernel = module.get_function('vector_norm')
330
scale_kernel = module.get_function('vector_scale')
331

332
# Use kernels from module
333
input_vec = cp.random.random(100000, dtype=cp.float32)
334
output_vec = cp.zeros_like(input_vec)
335

336
# Calculate norms
337
grid_size = (len(input_vec) + 255) // 256
338
norm_kernel((grid_size,), (256,), (input_vec, output_vec, len(input_vec)))
339

340
# Scale vector
341
scaled_vec = cp.zeros_like(input_vec)  
342
scale_kernel((grid_size,), (256,), (input_vec, scaled_vec, 2.5, len(input_vec)))
343
```
344

345
### Function Fusion Examples
346

347
```python
348
# Fuse multiple operations for better performance
349
@cp.fuse(kernel_name='fused_operations')
350
def complex_computation(x, y, z):
351
    """Fused function combining multiple mathematical operations."""
352
    temp1 = cp.sin(x) * cp.cos(y)
353
    temp2 = cp.exp(-z**2)
354
    return temp1 * temp2 + cp.sqrt(x**2 + y**2)
355

356
# Use fused function
357
x = cp.random.random((1000, 1000))
358
y = cp.random.random((1000, 1000))
359
z = cp.random.random((1000, 1000))
360

361
result = complex_computation(x, y, z)  # Executes as single fused kernel
362

363
# Compare with unfused version
364
def unfused_computation(x, y, z):
365
    """Same computation without fusion."""
366
    temp1 = cp.sin(x) * cp.cos(y)
367
    temp2 = cp.exp(-z**2)  
368
    return temp1 * temp2 + cp.sqrt(x**2 + y**2)
369

370
# Fused version is typically faster due to reduced memory traffic
371
```
372

373
### Performance Optimization
374

375
```python
376
# Kernel with optimized memory access patterns
377
optimized_code = '''
378
extern "C" __global__
379
void optimized_transpose(float* input, float* output, int rows, int cols) {
380
    __shared__ float tile[32][32+1];  // +1 to avoid bank conflicts
381
    
382
    int x = blockIdx.x * blockDim.x + threadIdx.x;
383
    int y = blockIdx.y * blockDim.y + threadIdx.y;
384
    
385
    // Coalesced read from global memory
386
    if (x < cols && y < rows) {
387
        tile[threadIdx.y][threadIdx.x] = input[y * cols + x];
388
    }
389
    
390
    __syncthreads();
391
    
392
    // Compute transposed coordinates
393
    x = blockIdx.y * blockDim.y + threadIdx.x;
394
    y = blockIdx.x * blockDim.x + threadIdx.y;
395
    
396
    // Coalesced write to global memory
397
    if (x < rows && y < cols) {
398
        output[y * rows + x] = tile[threadIdx.x][threadIdx.y];
399
    }
400
}
401
'''
402

403
transpose_kernel = cp.RawKernel(optimized_code, 'optimized_transpose')
404

405
# Use optimized kernel
406
matrix = cp.random.random((4096, 4096), dtype=cp.float32)
407
transposed = cp.zeros((4096, 4096), dtype=cp.float32)
408

409
block_dim = (32, 32)
410
grid_dim = ((matrix.shape[1] + 31) // 32, (matrix.shape[0] + 31) // 32)
411

412
transpose_kernel(grid_dim, block_dim, (matrix, transposed, 
413
                                     matrix.shape[0], matrix.shape[1]))
414
```
415

416
## Best Practices
417

418
### Kernel Development Guidelines
419

420
```python
421
# 1. Use appropriate data types
422
kernel_float32 = cp.ElementwiseKernel(
423
    'float32 x, float32 y',  # Specify exact precision needed
424
    'float32 z',
425
    'z = x + y',
426
    'add_f32'
427
)
428

429
# 2. Optimize memory access patterns
430
# Good: Coalesced access
431
coalesced_kernel = cp.RawKernel('''
432
extern "C" __global__ void coalesced_access(float* data, int n) {
433
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
434
    if (idx < n) {
435
        data[idx] = data[idx] * 2.0f;  // Sequential access
436
    }
437
}
438
''', 'coalesced_access')
439

440
# 3. Use shared memory for data reuse
441
shared_mem_kernel = cp.RawKernel('''
442
extern "C" __global__ void use_shared_memory(float* input, float* output, int n) {
443
    extern __shared__ float shared[];
444
    int tid = threadIdx.x;
445
    int idx = blockIdx.x * blockDim.x + tid;
446
    
447
    // Load to shared memory
448
    shared[tid] = (idx < n) ? input[idx] : 0.0f;
449
    __syncthreads();
450
    
451
    // Process using shared memory
452
    if (idx < n) {
453
        output[idx] = shared[tid] * 2.0f;
454
    }
455
}
456
''', 'use_shared_memory')
457

458
# 4. Handle boundary conditions properly
459
boundary_safe_kernel = cp.ElementwiseKernel(
460
    'raw T input, int32 size',
461
    'T output',
462
    '''
463
    int idx = i;  // Current thread index
464
    if (idx < size) {
465
        output = input[idx] * 2;
466
    }
467
    ''',
468
    'boundary_safe'
469
)
470
```

Version

Tile

Files

custom-kernels.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

custom-kernels.mddocs/