or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

array-operations.mdcuda-integration.mdcustom-kernels.mdfft.mdindex.mdio-operations.mdjit-compilation.mdlinear-algebra.mdmathematical-functions.mdperformance-profiling.mdpolynomial-operations.mdrandom.mdscipy-extensions.md

custom-kernels.mddocs/

0

# Custom Kernel Development

1

2

CuPy provides comprehensive facilities for developing custom CUDA kernels, enabling users to write high-performance GPU code directly in Python. This includes element-wise operations, reduction operations, and raw CUDA kernels for specialized computational tasks that require maximum performance or functionality not available in standard CuPy operations.

3

4

## Capabilities

5

6

### Element-wise Kernels

7

8

Element-wise kernels apply operations to individual elements of arrays, supporting broadcasting and multiple input/output arrays.

9

10

```python { .api }

11

class ElementwiseKernel:

12

"""

13

Compile and execute element-wise CUDA kernels from C++ code.

14

15

ElementwiseKernel enables writing custom element-wise operations

16

that can be applied to CuPy arrays with automatic broadcasting

17

and type handling.

18

"""

19

def __init__(self, in_params, out_params, operation, name="kernel", **kwargs):

20

"""

21

Parameters:

22

in_params: str - Input parameter declarations (e.g., "T x, T y")

23

out_params: str - Output parameter declarations (e.g., "T z")

24

operation: str - C++ code for the element-wise operation

25

name: str, optional - Kernel name for debugging

26

reduce_dims: bool, optional - Enable dimension reduction

27

type_preamble: str, optional - Additional type declarations

28

no_return: bool, optional - Kernel doesn't return a value

29

"""

30

31

def __call__(self, *args, **kwargs):

32

"""

33

Execute the kernel on input arrays.

34

35

Parameters:

36

*args: Input arrays matching in_params specification

37

**kwargs: Additional execution parameters

38

"""

39

40

def create_elementwise_kernel(in_params, out_params, operation, name="kernel"):

41

"""

42

Create an element-wise kernel function.

43

44

Parameters:

45

in_params: str - Input parameter declarations

46

out_params: str - Output parameter declarations

47

operation: str - C++ operation code

48

name: str, optional - Kernel name

49

"""

50

```

51

52

### Reduction Kernels

53

54

Reduction kernels perform operations that reduce arrays along specified axes, such as sum, max, min, or custom reduction operations.

55

56

```python { .api }

57

class ReductionKernel:

58

"""

59

Compile and execute reduction CUDA kernels from C++ code.

60

61

ReductionKernel enables writing custom reduction operations

62

that can reduce arrays along specified axes with automatic

63

handling of different reduction strategies.

64

"""

65

def __init__(self, in_params, out_params, map_expr, reduce_expr, post_map_expr=None, **kwargs):

66

"""

67

Parameters:

68

in_params: str - Input parameter declarations

69

out_params: str - Output parameter declarations

70

map_expr: str - Expression applied to each input element

71

reduce_expr: str - Expression for combining mapped values

72

post_map_expr: str, optional - Expression applied after reduction

73

identity: str, optional - Identity value for reduction

74

name: str, optional - Kernel name

75

reduce_type: str, optional - Data type for intermediate reductions

76

type_preamble: str, optional - Additional type declarations

77

"""

78

79

def __call__(self, *args, axis=None, dtype=None, out=None, keepdims=False):

80

"""

81

Execute the reduction kernel.

82

83

Parameters:

84

*args: Input arrays

85

axis: int or tuple, optional - Axes along which to reduce

86

dtype: data-type, optional - Output data type

87

out: ndarray, optional - Output array

88

keepdims: bool, optional - Keep reduced dimensions

89

"""

90

91

def create_reduction_kernel(in_params, out_params, map_expr, reduce_expr, **kwargs):

92

"""

93

Create a reduction kernel function.

94

95

Parameters:

96

in_params: str - Input parameter declarations

97

out_params: str - Output parameter declarations

98

map_expr: str - Mapping expression

99

reduce_expr: str - Reduction expression

100

**kwargs: Additional kernel parameters

101

"""

102

```

103

104

### Raw CUDA Kernels

105

106

Raw kernels provide direct access to CUDA kernel development with full control over thread indexing, shared memory, and GPU programming constructs.

107

108

```python { .api }

109

class RawKernel:

110

"""

111

Compile and execute raw CUDA kernels from C++ source code.

112

113

RawKernel provides complete control over CUDA kernel development,

114

allowing direct manipulation of thread indices, shared memory,

115

and advanced GPU programming patterns.

116

"""

117

def __init__(self, code, name, **kwargs):

118

"""

119

Parameters:

120

code: str - Complete CUDA C++ kernel source code

121

name: str - Kernel function name

122

options: tuple, optional - Compilation options

123

headers: tuple, optional - Header files to include

124

backend: str, optional - Compilation backend ('nvcc', 'nvrtc')

125

translate_cucomplex: bool, optional - Handle complex number types

126

enable_cooperative_groups: bool, optional - Enable cooperative groups

127

jitify: bool, optional - Use Jitify for compilation

128

"""

129

130

def __call__(self, grid, block, args=(), shared_mem=0, stream=None):

131

"""

132

Launch the CUDA kernel.

133

134

Parameters:

135

grid: tuple - Grid dimensions (blocks per grid)

136

block: tuple - Block dimensions (threads per block)

137

args: tuple, optional - Kernel arguments

138

shared_mem: int, optional - Shared memory size in bytes

139

stream: Stream, optional - CUDA stream for execution

140

"""

141

142

@property

143

def attributes(self):

144

"""Get kernel attributes (max threads, registers, etc.)."""

145

146

class RawModule:

147

"""

148

Load and manage CUDA modules containing multiple kernels.

149

150

RawModule allows loading pre-compiled CUDA modules (PTX/CUBIN)

151

and accessing multiple kernel functions within the module.

152

"""

153

def __init__(self, code=None, **kwargs):

154

"""

155

Parameters:

156

code: str or bytes, optional - Module source code or binary

157

cubin: bytes, optional - Pre-compiled CUBIN binary

158

ptx: bytes, optional - PTX assembly code

159

"""

160

161

def get_function(self, name):

162

"""

163

Get a kernel function from the module.

164

165

Parameters:

166

name: str - Function name in the module

167

"""

168

169

def get_global(self, name):

170

"""

171

Get a global variable from the module.

172

173

Parameters:

174

name: str - Global variable name

175

"""

176

177

def compile_with_cache(source, **kwargs):

178

"""

179

Compile CUDA source with caching for faster subsequent uses.

180

181

Parameters:

182

source: str - CUDA C++ source code

183

**kwargs: Compilation options

184

"""

185

```

186

187

### Kernel Utilities

188

189

Utility functions and classes for kernel development and management.

190

191

```python { .api }

192

def memoize(for_each_device=False):

193

"""

194

Decorator for memoizing kernel compilation results.

195

196

Parameters:

197

for_each_device: bool, optional - Cache separately for each device

198

"""

199

200

def clear_memo():

201

"""Clear all memoized kernel compilation cache."""

202

203

class ParameterInfo:

204

"""

205

Information about kernel parameters for type checking and validation.

206

207

Provides metadata about kernel parameter types, shapes, and constraints

208

for automatic validation and optimization.

209

"""

210

def __init__(self, param, is_const=False): ...

211

212

def _get_param_info(s, is_const=False):

213

"""

214

Parse parameter declaration strings into ParameterInfo objects.

215

216

Parameters:

217

s: str - Parameter declaration string

218

is_const: bool, optional - Whether parameter is const

219

"""

220

```

221

222

## Usage Examples

223

224

### Element-wise Kernel Example

225

226

```python

227

import cupy as cp

228

229

# Define a custom element-wise kernel for vector addition with scaling

230

add_scale_kernel = cp.ElementwiseKernel(

231

'T x, T y, T scale', # Input parameters

232

'T z', # Output parameter

233

'z = (x + y) * scale', # Operation

234

'add_scale_kernel' # Kernel name

235

)

236

237

# Create input arrays

238

x = cp.array([1, 2, 3, 4], dtype=cp.float32)

239

y = cp.array([5, 6, 7, 8], dtype=cp.float32)

240

scale = cp.float32(2.0)

241

242

# Execute the kernel

243

result = add_scale_kernel(x, y, scale)

244

print(result) # [12. 16. 20. 24.]

245

246

# Complex element-wise operation with multiple outputs

247

complex_kernel = cp.ElementwiseKernel(

248

'T x, T y',

249

'T sum, T diff, T prod',

250

'''

251

sum = x + y;

252

diff = x - y;

253

prod = x * y;

254

''',

255

'complex_ops'

256

)

257

258

# Multiple outputs

259

x = cp.array([10, 20, 30])

260

y = cp.array([1, 2, 3])

261

sum_out, diff_out, prod_out = complex_kernel(x, y)

262

```

263

264

### Reduction Kernel Example

265

266

```python

267

# Custom reduction kernel for computing variance

268

variance_kernel = cp.ReductionKernel(

269

'T x, T mean', # Input parameters

270

'T variance', # Output parameter

271

'(x - mean) * (x - mean)', # Map expression

272

'a + b', # Reduce expression (sum of squares)

273

'0', # Identity (zero)

274

'variance_kernel'

275

)

276

277

# Compute variance using the custom kernel

278

data = cp.random.normal(0, 1, 1000000)

279

mean = cp.mean(data)

280

variance = variance_kernel(data, mean) / (data.size - 1)

281

282

# Reduction with post-processing

283

rms_kernel = cp.ReductionKernel(

284

'T x',

285

'T rms',

286

'x * x', # Square each element

287

'a + b', # Sum the squares

288

'sqrt(a / _in_ind.size())', # Post-processing: sqrt(sum/n)

289

'0'

290

)

291

292

data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)

293

rms_value = rms_kernel(data)

294

```

295

296

### Raw Kernel Example

297

298

```python

299

# Raw CUDA kernel for matrix multiplication (simplified)

300

matmul_kernel_code = r'''

301

extern "C" __global__

302

void matmul_kernel(float* A, float* B, float* C, int M, int N, int K) {

303

int row = blockIdx.y * blockDim.y + threadIdx.y;

304

int col = blockIdx.x * blockDim.x + threadIdx.x;

305

306

if (row < M && col < N) {

307

float sum = 0.0f;

308

for (int k = 0; k < K; k++) {

309

sum += A[row * K + k] * B[k * N + col];

310

}

311

C[row * N + col] = sum;

312

}

313

}

314

'''

315

316

# Compile the kernel

317

matmul_kernel = cp.RawKernel(matmul_kernel_code, 'matmul_kernel')

318

319

# Prepare matrices

320

M, N, K = 1024, 1024, 1024

321

A = cp.random.rand(M, K, dtype=cp.float32)

322

B = cp.random.rand(K, N, dtype=cp.float32)

323

C = cp.zeros((M, N), dtype=cp.float32)

324

325

# Launch kernel with appropriate grid/block dimensions

326

threads_per_block = (16, 16)

327

blocks_per_grid = (

328

(N + threads_per_block[0] - 1) // threads_per_block[0],

329

(M + threads_per_block[1] - 1) // threads_per_block[1]

330

)

331

332

matmul_kernel(

333

blocks_per_grid,

334

threads_per_block,

335

(A, B, C, M, N, K)

336

)

337

338

# Advanced raw kernel with shared memory

339

shared_mem_kernel_code = r'''

340

extern "C" __global__

341

void vector_reduction(float* input, float* output, int n) {

342

extern __shared__ float sdata[];

343

344

unsigned int tid = threadIdx.x;

345

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

346

347

// Load data into shared memory

348

sdata[tid] = (i < n) ? input[i] : 0.0f;

349

__syncthreads();

350

351

// Perform reduction in shared memory

352

for (int s = blockDim.x / 2; s > 0; s >>= 1) {

353

if (tid < s) {

354

sdata[tid] += sdata[tid + s];

355

}

356

__syncthreads();

357

}

358

359

// Write result for this block to global memory

360

if (tid == 0) output[blockIdx.x] = sdata[0];

361

}

362

'''

363

364

shared_kernel = cp.RawKernel(shared_mem_kernel_code, 'vector_reduction')

365

366

# Input data

367

input_data = cp.random.rand(1000000, dtype=cp.float32)

368

block_size = 256

369

grid_size = (input_data.size + block_size - 1) // block_size

370

output = cp.zeros(grid_size, dtype=cp.float32)

371

372

# Launch with shared memory

373

shared_memory_size = block_size * 4 # 4 bytes per float

374

shared_kernel(

375

(grid_size,),

376

(block_size,),

377

(input_data, output, input_data.size),

378

shared_mem=shared_memory_size

379

)

380

381

# Final reduction on CPU or with another kernel

382

final_result = cp.sum(output)

383

```

384

385

### Performance Optimization Examples

386

387

```python

388

# Using kernel caching for better performance

389

@cp.memoize(for_each_device=True)

390

def create_optimized_kernel():

391

return cp.ElementwiseKernel(

392

'T x, T y, T alpha, T beta',

393

'T z',

394

'z = alpha * x + beta * y', # AXPY operation

395

'axpy_kernel',

396

options=('-O3',), # Optimization flags

397

)

398

399

# Reuse compiled kernel multiple times

400

axpy_kernel = create_optimized_kernel()

401

for i in range(100):

402

result = axpy_kernel(x, y, alpha, beta)

403

404

# Template specialization for different data types

405

template_kernel_code = r'''

406

template<typename T>

407

__device__ T complex_function(T x, T y) {

408

return sqrt(x * x + y * y);

409

}

410

411

extern "C" __global__

412

void magnitude_kernel(float* x, float* y, float* result, int n) {

413

int i = blockIdx.x * blockDim.x + threadIdx.x;

414

if (i < n) {

415

result[i] = complex_function(x[i], y[i]);

416

}

417

}

418

'''

419

420

# Compile with template support

421

magnitude_kernel = cp.RawKernel(

422

template_kernel_code,

423

'magnitude_kernel',

424

options=('--std=c++14',)

425

)

426

427

# Custom reduction with atomics for complex reductions

428

atomic_kernel_code = r'''

429

extern "C" __global__

430

void atomic_histogram(int* data, int* hist, int n, int num_bins) {

431

int i = blockIdx.x * blockDim.x + threadIdx.x;

432

if (i < n) {

433

int bin = data[i] % num_bins;

434

atomicAdd(&hist[bin], 1);

435

}

436

}

437

'''

438

439

histogram_kernel = cp.RawKernel(atomic_kernel_code, 'atomic_histogram')

440

441

# Create histogram using atomic operations

442

data = cp.random.randint(0, 100, size=1000000)

443

histogram = cp.zeros(100, dtype=cp.int32)

444

445

threads_per_block = 256

446

blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block

447

448

histogram_kernel(

449

(blocks_per_grid,),

450

(threads_per_block,),

451

(data, histogram, data.size, 100)

452

)

453

```

454

455

## Advanced Features

456

457

### Kernel Fusion and Optimization

458

459

```python

460

# Fused kernel combining multiple operations

461

fused_kernel = cp.ElementwiseKernel(

462

'T x, T y, T z, T alpha, T beta',

463

'T result',

464

'''

465

T temp1 = x * alpha + y * beta; // Linear combination

466

T temp2 = temp1 * temp1; // Square

467

result = sqrt(temp2 + z); // Final computation

468

''',

469

'fused_operations'

470

)

471

472

# Memory coalescing optimization

473

coalesced_kernel_code = r'''

474

extern "C" __global__

475

void coalesced_transpose(float* input, float* output, int width, int height) {

476

__shared__ float tile[32][32];

477

478

int x = blockIdx.x * blockDim.x + threadIdx.x;

479

int y = blockIdx.y * blockDim.y + threadIdx.y;

480

481

// Read from global memory (coalesced)

482

if (x < width && y < height) {

483

tile[threadIdx.y][threadIdx.x] = input[y * width + x];

484

}

485

486

__syncthreads();

487

488

// Calculate transposed coordinates

489

x = blockIdx.y * blockDim.y + threadIdx.x;

490

y = blockIdx.x * blockDim.x + threadIdx.y;

491

492

// Write to global memory (coalesced)

493

if (x < height && y < width) {

494

output[y * height + x] = tile[threadIdx.x][threadIdx.y];

495

}

496

}

497

'''

498

499

transpose_kernel = cp.RawKernel(coalesced_kernel_code, 'coalesced_transpose')

500

```

501

502

### Error Handling and Debugging

503

504

```python

505

# Kernel with error checking

506

debug_kernel_code = r'''

507

extern "C" __global__

508

void debug_kernel(float* input, float* output, int n) {

509

int i = blockIdx.x * blockDim.x + threadIdx.x;

510

511

if (i >= n) return; // Bounds checking

512

513

float value = input[i];

514

515

// Check for invalid values

516

if (isnan(value) || isinf(value)) {

517

printf("Invalid value at index %d: %f\n", i, value);

518

output[i] = 0.0f;

519

} else {

520

output[i] = sqrt(value);

521

}

522

}

523

'''

524

525

debug_kernel = cp.RawKernel(debug_kernel_code, 'debug_kernel')

526

527

# Use with error checking enabled

528

cp.cuda.set_device(0)

529

try:

530

result = debug_kernel((grid_size,), (block_size,), (input_data, output, n))

531

cp.cuda.synchronize() # Ensure completion and flush printf

532

except cp.cuda.runtime.CUDARuntimeError as e:

533

print(f"Kernel execution failed: {e}")

534

```

535

536

Custom kernel development in CuPy provides the foundation for high-performance GPU computing, enabling developers to implement specialized algorithms, optimize critical computational bottlenecks, and access advanced CUDA features while maintaining integration with the broader CuPy ecosystem.