or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

array-operations.mdcuda-integration.mdcupy-extensions.mdcustom-kernels.mdfft-operations.mdindex.mdlinear-algebra.mdmath-functions.mdrandom-generation.mdstatistical-functions.md

custom-kernels.mddocs/

0

# Custom Kernels

1

2

User-defined CUDA kernel creation through ElementwiseKernel, ReductionKernel, and RawKernel classes, enabling custom GPU operations and performance-critical computations. These tools allow developers to write custom CUDA code while maintaining CuPy's array interface.

3

4

## Capabilities

5

6

### ElementwiseKernel

7

8

Create custom element-wise operations that apply functions to each element of input arrays.

9

10

```python { .api }

11

class ElementwiseKernel:

12

"""User-defined elementwise kernel for custom element-wise operations.

13

14

Enables creation of custom CUDA kernels that operate element-wise

15

on input arrays, similar to NumPy universal functions but with

16

custom GPU-optimized implementations.

17

"""

18

19

def __init__(self, in_params, out_params, operation, name='kernel', **kwargs):

20

"""Initialize elementwise kernel.

21

22

Parameters:

23

- in_params: str, input parameter specification (e.g., 'T x, T y')

24

- out_params: str, output parameter specification (e.g., 'T z')

25

- operation: str, CUDA C++ code for the operation

26

- name: str, kernel name for debugging

27

- reduce_dims: bool, whether to reduce dimensions

28

- type_preamble: str, additional type definitions

29

- preamble: str, additional CUDA code before kernel

30

"""

31

32

def __call__(self, *args, **kwargs):

33

"""Execute kernel on input arrays.

34

35

Parameters:

36

- args: input arrays matching in_params specification

37

- kwargs: additional kernel arguments

38

39

Returns:

40

cupy.ndarray: output array(s) as specified by out_params

41

"""

42

```

43

44

### ReductionKernel

45

46

Create custom reduction operations that combine array elements along specified axes.

47

48

```python { .api }

49

class ReductionKernel:

50

"""User-defined reduction kernel for custom reduction operations.

51

52

Enables creation of custom CUDA reduction kernels that combine

53

array elements along axes, similar to NumPy reduction functions

54

but with custom GPU-optimized implementations.

55

"""

56

57

def __init__(self, in_params, out_params, map_expr, reduce_expr,

58

post_map_expr='', identity=None, name='reduce_kernel', **kwargs):

59

"""Initialize reduction kernel.

60

61

Parameters:

62

- in_params: str, input parameter specification

63

- out_params: str, output parameter specification

64

- map_expr: str, expression to map input to intermediate values

65

- reduce_expr: str, expression to reduce intermediate values

66

- post_map_expr: str, expression to post-process mapped values

67

- identity: str, identity value for reduction

68

- name: str, kernel name for debugging

69

- reduce_type: str, intermediate reduction type

70

- type_preamble: str, additional type definitions

71

- preamble: str, additional CUDA code

72

"""

73

74

def __call__(self, *args, **kwargs):

75

"""Execute reduction kernel on input arrays.

76

77

Parameters:

78

- args: input arrays matching in_params specification

79

- axis: int or tuple, axes to reduce over

80

- keepdims: bool, whether to keep reduced dimensions

81

82

Returns:

83

cupy.ndarray: reduced output array

84

"""

85

```

86

87

### RawKernel

88

89

Create kernels with full control over CUDA code and execution parameters.

90

91

```python { .api }

92

class RawKernel:

93

"""User-defined raw kernel for maximum control over CUDA execution.

94

95

Provides direct access to CUDA kernel launch parameters and

96

complete control over kernel implementation, suitable for

97

complex custom algorithms and performance optimization.

98

"""

99

100

def __init__(self, code, name, backend='nvrtc', **kwargs):

101

"""Initialize raw kernel from CUDA source code.

102

103

Parameters:

104

- code: str, complete CUDA kernel source code

105

- name: str, kernel function name in source code

106

- backend: str, compilation backend ('nvrtc' or 'nvcc')

107

- options: tuple, compiler options

108

- jitify: bool, whether to use jitify for compilation

109

- enable_cooperative_groups: bool, enable cooperative groups

110

"""

111

112

def __call__(self, grid, block, args, **kwargs):

113

"""Execute raw kernel with specified launch configuration.

114

115

Parameters:

116

- grid: tuple, grid dimensions (gridDim)

117

- block: tuple, block dimensions (blockDim)

118

- args: tuple, kernel arguments

119

- shared_mem: int, shared memory size in bytes

120

- stream: Stream, CUDA stream for execution

121

"""

122

```

123

124

### RawModule

125

126

Load and manage complete CUDA modules with multiple kernels.

127

128

```python { .api }

129

class RawModule:

130

"""User-defined raw module for managing multiple CUDA kernels.

131

132

Enables loading complete CUDA modules containing multiple

133

kernel functions, constants, and device functions for

134

complex GPU applications.

135

"""

136

137

def __init__(self, code, backend='nvrtc', **kwargs):

138

"""Initialize raw module from CUDA source code.

139

140

Parameters:

141

- code: str, complete CUDA module source code

142

- backend: str, compilation backend ('nvrtc' or 'nvcc')

143

- options: tuple, compiler options

144

- name_expressions: list, symbols to extract from module

145

- jitify: bool, whether to use jitify

146

"""

147

148

def get_function(self, name):

149

"""Get kernel function by name.

150

151

Parameters:

152

- name: str, kernel function name

153

154

Returns:

155

RawKernel: kernel function object

156

"""

157

158

def get_global_var(self, name):

159

"""Get global variable by name.

160

161

Parameters:

162

- name: str, global variable name

163

164

Returns:

165

int: device pointer to global variable

166

"""

167

```

168

169

### Kernel Compilation and Caching

170

171

Utilities for kernel compilation and performance optimization.

172

173

```python { .api }

174

def memoize(for_each_device=False):

175

"""Decorator to memoize function results for performance.

176

177

Parameters:

178

- for_each_device: bool, whether to memoize per device

179

180

Returns:

181

callable: memoized function

182

"""

183

184

def clear_memo():

185

"""Clear memoization cache to free memory."""

186

187

def compile_with_cache(source, filename, dirname=None, **kwargs):

188

"""Compile CUDA source with caching for improved performance.

189

190

Parameters:

191

- source: str, CUDA source code

192

- filename: str, source filename for cache key

193

- dirname: str, directory for cache files

194

- kwargs: additional compilation options

195

196

Returns:

197

compiled module object

198

"""

199

```

200

201

### JIT Compilation Interface

202

203

Just-in-time compilation for dynamic kernel generation.

204

205

```python { .api }

206

def rawkernel(mode='python', device=False):

207

"""Decorator for creating raw kernels from Python functions.

208

209

Enables writing CUDA kernels using Python syntax with automatic

210

compilation to CUDA C++ code.

211

212

Parameters:

213

- mode: str, compilation mode ('python' or 'cuda')

214

- device: bool, whether function runs on device

215

216

Returns:

217

callable: decorated kernel function

218

"""

219

```

220

221

## Usage Examples

222

223

### Basic ElementwiseKernel

224

225

```python

226

import cupy as cp

227

228

# Define custom elementwise operation

229

add_kernel = cp.ElementwiseKernel(

230

'float32 x, float32 y', # Input parameters

231

'float32 z', # Output parameter

232

'z = x + y * 2', # Operation

233

'custom_add' # Kernel name

234

)

235

236

# Create input arrays

237

a = cp.array([1, 2, 3, 4], dtype=cp.float32)

238

b = cp.array([5, 6, 7, 8], dtype=cp.float32)

239

240

# Execute kernel

241

result = add_kernel(a, b)

242

print(result) # [11, 14, 17, 20]

243

244

# More complex elementwise operation

245

complex_kernel = cp.ElementwiseKernel(

246

'float32 x, float32 y, float32 alpha',

247

'float32 z',

248

'''

249

float temp = x * alpha + y;

250

z = temp > 0 ? temp : 0; // ReLU activation

251

''',

252

'relu_transform'

253

)

254

255

result = complex_kernel(a, b, 0.5)

256

```

257

258

### Custom ReductionKernel

259

260

```python

261

import cupy as cp

262

263

# Define custom reduction operation (sum of squares)

264

sum_of_squares = cp.ReductionKernel(

265

'float32 x', # Input parameter

266

'float32 out', # Output parameter

267

'x * x', # Map expression (square each element)

268

'a + b', # Reduce expression (sum)

269

'0', # Identity value

270

'sum_of_squares' # Kernel name

271

)

272

273

# Test the kernel

274

data = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)

275

result = sum_of_squares(data)

276

print(result) # 55.0 (1² + 2² + 3² + 4² + 5²)

277

278

# Custom reduction with axis support

279

axis_result = sum_of_squares(data.reshape(1, -1), axis=1)

280

print(axis_result) # [55.]

281

282

# More complex reduction: weighted mean

283

weighted_mean = cp.ReductionKernel(

284

'float32 x, float32 w',

285

'float32 out',

286

'x * w', # Multiply value by weight

287

'a + b', # Sum weighted values

288

'0',

289

'weighted_sum'

290

)

291

292

values = cp.array([1, 2, 3, 4], dtype=cp.float32)

293

weights = cp.array([0.1, 0.2, 0.3, 0.4], dtype=cp.float32)

294

weighted_sum = weighted_mean(values, weights)

295

total_weight = cp.sum(weights)

296

mean = weighted_sum / total_weight

297

print(f"Weighted mean: {mean}")

298

```

299

300

### Advanced RawKernel

301

302

```python

303

import cupy as cp

304

305

# Define complex CUDA kernel

306

matrix_multiply_kernel = cp.RawKernel(r'''

307

extern "C" __global__

308

void matrix_multiply(const float* A, const float* B, float* C,

309

int M, int N, int K) {

310

int row = blockIdx.y * blockDim.y + threadIdx.y;

311

int col = blockIdx.x * blockDim.x + threadIdx.x;

312

313

if (row < M && col < N) {

314

float sum = 0.0f;

315

for (int k = 0; k < K; k++) {

316

sum += A[row * K + k] * B[k * N + col];

317

}

318

C[row * N + col] = sum;

319

}

320

}

321

''', 'matrix_multiply')

322

323

# Create test matrices

324

M, N, K = 1024, 1024, 512

325

A = cp.random.random((M, K), dtype=cp.float32)

326

B = cp.random.random((K, N), dtype=cp.float32)

327

C = cp.zeros((M, N), dtype=cp.float32)

328

329

# Configure kernel launch

330

block_size = (16, 16)

331

grid_size = ((N + block_size[0] - 1) // block_size[0],

332

(M + block_size[1] - 1) // block_size[1])

333

334

# Execute kernel

335

matrix_multiply_kernel(

336

grid_size, block_size,

337

(A, B, C, M, N, K) # Kernel arguments

338

)

339

340

# Verify result

341

expected = cp.dot(A, B)

342

print(f"Results match: {cp.allclose(C, expected)}")

343

```

344

345

### RawModule with Multiple Kernels

346

347

```python

348

import cupy as cp

349

350

# Define module with multiple related kernels

351

cuda_module_code = r'''

352

extern "C" {

353

354

__device__ float activation_relu(float x) {

355

return fmaxf(0.0f, x);

356

}

357

358

__device__ float activation_sigmoid(float x) {

359

return 1.0f / (1.0f + expf(-x));

360

}

361

362

__global__ void apply_activation(const float* input, float* output,

363

int size, int activation_type) {

364

int idx = blockIdx.x * blockDim.x + threadIdx.x;

365

if (idx < size) {

366

float x = input[idx];

367

if (activation_type == 0) {

368

output[idx] = activation_relu(x);

369

} else if (activation_type == 1) {

370

output[idx] = activation_sigmoid(x);

371

}

372

}

373

}

374

375

__global__ void vector_add(const float* a, const float* b,

376

float* c, int size) {

377

int idx = blockIdx.x * blockDim.x + threadIdx.x;

378

if (idx < size) {

379

c[idx] = a[idx] + b[idx];

380

}

381

}

382

383

}

384

'''

385

386

# Load module

387

module = cp.RawModule(code=cuda_module_code)

388

389

# Get kernel functions

390

activation_kernel = module.get_function('apply_activation')

391

add_kernel = module.get_function('vector_add')

392

393

# Test activation kernel

394

data = cp.array([-2, -1, 0, 1, 2], dtype=cp.float32)

395

output = cp.zeros_like(data)

396

397

block_size = 256

398

grid_size = (len(data) + block_size - 1) // block_size

399

400

# Apply ReLU (activation_type=0)

401

activation_kernel(

402

(grid_size,), (block_size,),

403

(data, output, data.size, 0)

404

)

405

print(f"ReLU: {output}") # [0, 0, 0, 1, 2]

406

407

# Apply Sigmoid (activation_type=1)

408

activation_kernel(

409

(grid_size,), (block_size,),

410

(data, output, data.size, 1)

411

)

412

print(f"Sigmoid: {output}")

413

```

414

415

### Performance Optimization Techniques

416

417

```python

418

import cupy as cp

419

import time

420

421

# Kernel with shared memory optimization

422

optimized_kernel = cp.RawKernel(r'''

423

extern "C" __global__

424

void optimized_reduction(const float* input, float* output, int size) {

425

extern __shared__ float sdata[];

426

427

int tid = threadIdx.x;

428

int i = blockIdx.x * blockDim.x + threadIdx.x;

429

430

// Load data into shared memory

431

sdata[tid] = (i < size) ? input[i] : 0;

432

__syncthreads();

433

434

// Perform reduction in shared memory

435

for (int s = blockDim.x / 2; s > 0; s >>= 1) {

436

if (tid < s) {

437

sdata[tid] += sdata[tid + s];

438

}

439

__syncthreads();

440

}

441

442

// Write result for this block to global memory

443

if (tid == 0) output[blockIdx.x] = sdata[0];

444

}

445

''', 'optimized_reduction')

446

447

# Benchmark against CuPy's built-in sum

448

data = cp.random.random(1000000, dtype=cp.float32)

449

450

# Time custom kernel

451

block_size = 256

452

grid_size = (data.size + block_size - 1) // block_size

453

output = cp.zeros(grid_size, dtype=cp.float32)

454

455

start_time = time.time()

456

for _ in range(100):

457

optimized_kernel(

458

(grid_size,), (block_size,),

459

(data, output, data.size),

460

shared_mem=block_size * 4 # 4 bytes per float

461

)

462

cp.cuda.Stream.null.synchronize()

463

custom_time = time.time() - start_time

464

465

# Time built-in sum

466

start_time = time.time()

467

for _ in range(100):

468

builtin_result = cp.sum(data)

469

cp.cuda.Stream.null.synchronize()

470

builtin_time = time.time() - start_time

471

472

custom_result = cp.sum(output)

473

print(f"Custom kernel time: {custom_time:.4f}s")

474

print(f"Built-in sum time: {builtin_time:.4f}s")

475

print(f"Results match: {cp.allclose(custom_result, builtin_result)}")

476

```

477

478

### Memory-Efficient Patterns

479

480

```python

481

import cupy as cp

482

483

# In-place operation kernel

484

inplace_kernel = cp.ElementwiseKernel(

485

'float32 x, float32 alpha',

486

'float32 x', # Same array for input and output

487

'x = x * alpha + 1',

488

'inplace_transform'

489

)

490

491

# Create data

492

data = cp.random.random(1000000, dtype=cp.float32)

493

original_ptr = data.data.ptr

494

495

# Apply in-place transformation

496

inplace_kernel(data, 2.0, data) # Modify data in-place

497

498

# Verify same memory location

499

print(f"Same memory location: {data.data.ptr == original_ptr}")

500

501

# Kernel with multiple outputs

502

multi_output_kernel = cp.ElementwiseKernel(

503

'float32 x',

504

'float32 sin_x, float32 cos_x, float32 tan_x',

505

'''

506

sin_x = sinf(x);

507

cos_x = cosf(x);

508

tan_x = tanf(x);

509

''',

510

'trig_functions'

511

)

512

513

# Compute multiple trigonometric functions simultaneously

514

angles = cp.linspace(0, 2 * cp.pi, 1000, dtype=cp.float32)

515

sin_vals, cos_vals, tan_vals = multi_output_kernel(angles)

516

517

print(f"Identity check: {cp.allclose(sin_vals**2 + cos_vals**2, 1.0)}")

518

```