or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

array-operations.mdcuda-integration.mdcustom-kernels.mdfft.mdindex.mdio-operations.mdjit-compilation.mdlinear-algebra.mdmathematical-functions.mdperformance-profiling.mdpolynomial-operations.mdrandom.mdscipy-extensions.md

cuda-integration.mddocs/

0

# CUDA Integration

1

2

CuPy provides comprehensive CUDA integration capabilities for advanced GPU programming, offering direct device management, memory operations, kernel execution, stream processing, and low-level CUDA API access optimized for high-performance computing applications.

3

4

## Capabilities

5

6

### Device Management

7

8

Core CUDA device management for controlling GPU devices and execution contexts.

9

10

```python { .api }

11

class Device:

12

"""

13

CUDA device context manager.

14

15

This class provides a convenient interface for managing CUDA device

16

contexts and switching between multiple GPUs.

17

"""

18

def __init__(self, device=None):

19

"""

20

Parameters:

21

device: int or Device, optional - CUDA device ID or Device object

22

"""

23

24

def __enter__(self): ...

25

def __exit__(self, *args): ...

26

27

def use(self):

28

"""Use this device for subsequent operations."""

29

30

@property

31

def id(self):

32

"""Get the device ID."""

33

34

def get_device_id():

35

"""

36

Get the current CUDA device ID.

37

"""

38

39

def get_cublas_handle():

40

"""

41

Get the cuBLAS handle for the current device.

42

"""

43

44

def synchronize():

45

"""

46

Synchronize the current device.

47

"""

48

49

def is_available():

50

"""

51

Check if CUDA is available.

52

"""

53

```

54

55

### Memory Management

56

57

Comprehensive memory management for GPU device memory, including allocators and memory pools.

58

59

```python { .api }

60

def alloc(size):

61

"""

62

Allocate device memory.

63

64

Parameters:

65

size: int - Size in bytes to allocate

66

"""

67

68

def malloc_managed(size):

69

"""

70

Allocate managed memory (Unified Memory).

71

72

Parameters:

73

size: int - Size in bytes to allocate

74

"""

75

76

def malloc_async(size):

77

"""

78

Allocate memory asynchronously.

79

80

Parameters:

81

size: int - Size in bytes to allocate

82

"""

83

84

class BaseMemory:

85

"""

86

Base class for memory objects.

87

88

This is the base class for all memory types in CuPy,

89

providing common interface for memory management.

90

"""

91

def __init__(self, size): ...

92

93

@property

94

def ptr(self):

95

"""Get memory pointer."""

96

97

@property

98

def size(self):

99

"""Get memory size in bytes."""

100

101

class Memory(BaseMemory):

102

"""

103

Device memory object.

104

105

Represents a chunk of device memory allocated on GPU.

106

"""

107

108

class ManagedMemory(BaseMemory):

109

"""

110

Managed memory object.

111

112

Represents unified memory accessible from both CPU and GPU.

113

"""

114

115

class MemoryAsync(BaseMemory):

116

"""

117

Asynchronous memory object.

118

119

Represents memory allocated asynchronously using memory pools.

120

"""

121

122

class MemoryPointer:

123

"""

124

Pointer to a device memory region.

125

126

This class represents a pointer to device memory and provides

127

methods for accessing and manipulating memory contents.

128

"""

129

def __init__(self, mem, offset, size, owner=None): ...

130

131

def copy_from_device(self, src, size): ...

132

def copy_from_device_async(self, src, size, stream=None): ...

133

def copy_from_host(self, mem, size): ...

134

def copy_from_host_async(self, mem, size, stream=None): ...

135

def copy_to_host(self, mem, size): ...

136

def copy_to_host_async(self, mem, size, stream=None): ...

137

def memset(self, value, size): ...

138

def memset_async(self, value, size, stream=None): ...

139

140

class UnownedMemory:

141

"""

142

Unowned memory reference.

143

144

Represents a reference to memory that is not owned by this object,

145

useful for wrapping external memory allocations.

146

"""

147

```

148

149

### Memory Pools

150

151

Memory pooling systems for efficient memory allocation and reuse.

152

153

```python { .api }

154

class MemoryPool:

155

"""

156

Memory pool for device memory.

157

158

Memory pools reduce allocation overhead by reusing previously

159

allocated memory blocks.

160

"""

161

def __init__(self, allocator=None):

162

"""

163

Parameters:

164

allocator: function, optional - Custom allocator function

165

"""

166

167

def malloc(self, size): ...

168

def free(self, mem): ...

169

def free_all_blocks(self): ...

170

def free_all_free(self): ...

171

def n_free_blocks(self): ...

172

def used_bytes(self): ...

173

def total_bytes(self): ...

174

175

class MemoryAsyncPool:

176

"""

177

Asynchronous memory pool.

178

179

Provides asynchronous memory allocation with stream ordering.

180

"""

181

def __init__(self, allocator=None): ...

182

183

def set_allocator(allocator):

184

"""

185

Set the memory allocator.

186

187

Parameters:

188

allocator: function or Allocator - Memory allocator to use

189

"""

190

191

def get_allocator():

192

"""

193

Get the current memory allocator.

194

"""

195

196

class PythonFunctionAllocator:

197

"""

198

Memory allocator using a Python function.

199

200

Wraps a Python function to provide custom memory allocation.

201

"""

202

def __init__(self, func, arg): ...

203

204

class CFunctionAllocator:

205

"""

206

Memory allocator using a C function.

207

208

Wraps a C function pointer for memory allocation.

209

"""

210

def __init__(self, func, arg): ...

211

212

def using_allocator(allocator=None):

213

"""

214

Context manager for temporarily using a different allocator.

215

216

Parameters:

217

allocator: Allocator, optional - Allocator to use temporarily

218

"""

219

```

220

221

### Pinned Memory

222

223

Host-side pinned memory management for efficient host-device transfers.

224

225

```python { .api }

226

def alloc_pinned_memory(size):

227

"""

228

Allocate pinned host memory.

229

230

Parameters:

231

size: int - Size in bytes to allocate

232

"""

233

234

class PinnedMemory:

235

"""

236

Pinned host memory object.

237

238

Represents page-locked host memory that can be accessed

239

by the GPU for faster transfers.

240

"""

241

def __init__(self, size): ...

242

243

class PinnedMemoryPointer:

244

"""

245

Pointer to pinned memory region.

246

247

Provides interface for accessing pinned memory contents.

248

"""

249

def __init__(self, mem, offset, size, owner): ...

250

251

class PinnedMemoryPool:

252

"""

253

Memory pool for pinned memory.

254

255

Manages allocation and reuse of pinned host memory.

256

"""

257

def __init__(self, allocator=None): ...

258

def malloc(self, size): ...

259

def free(self, mem): ...

260

261

def set_pinned_memory_allocator(allocator):

262

"""

263

Set the pinned memory allocator.

264

265

Parameters:

266

allocator: function - Pinned memory allocator function

267

"""

268

```

269

270

### Streams and Events

271

272

CUDA streams and events for managing asynchronous operations and synchronization.

273

274

```python { .api }

275

class Stream:

276

"""

277

CUDA stream for asynchronous operations.

278

279

Streams allow operations to be executed asynchronously and

280

can be used to overlap computation and memory transfers.

281

"""

282

def __init__(self, null=False, non_blocking=False, ptds=False):

283

"""

284

Parameters:

285

null: bool, optional - Use the default stream if True

286

non_blocking: bool, optional - Create a non-blocking stream

287

ptds: bool, optional - Use per-thread default stream

288

"""

289

290

def __enter__(self): ...

291

def __exit__(self, *args): ...

292

293

def synchronize(self): ...

294

def add_callback(self, callback, arg): ...

295

def record(self, event=None): ...

296

def wait_event(self, event): ...

297

298

@property

299

def ptr(self):

300

"""Get the stream pointer."""

301

302

class ExternalStream:

303

"""

304

Wrapper for external CUDA stream.

305

306

Allows integration with CUDA streams created outside of CuPy.

307

"""

308

def __init__(self, ptr): ...

309

310

def get_current_stream():

311

"""

312

Get the current CUDA stream.

313

"""

314

315

class Event:

316

"""

317

CUDA event for synchronization.

318

319

Events provide a way to monitor the progress of operations

320

and synchronize between different streams.

321

"""

322

def __init__(self, block=True, disable_timing=False, interprocess=False):

323

"""

324

Parameters:

325

block: bool, optional - Use blocking synchronization

326

disable_timing: bool, optional - Disable timing measurements

327

interprocess: bool, optional - Enable interprocess usage

328

"""

329

330

def record(self, stream=None): ...

331

def synchronize(self): ...

332

def query(self): ...

333

def elapsed_time(self, end_event): ...

334

335

def get_elapsed_time(start_event, end_event):

336

"""

337

Get elapsed time between events.

338

339

Parameters:

340

start_event: Event - Start event

341

end_event: Event - End event

342

"""

343

```

344

345

### CUDA Graphs

346

347

CUDA graphs for optimizing sequences of operations.

348

349

```python { .api }

350

class Graph:

351

"""

352

CUDA graph for capturing and replaying operation sequences.

353

354

Graphs allow capturing a sequence of CUDA operations and

355

replaying them efficiently with reduced launch overhead.

356

"""

357

def __init__(self): ...

358

359

def begin_capture(self, stream=None): ...

360

def end_capture(self, stream=None): ...

361

def launch(self, stream=None): ...

362

def debug_dot_print(self, path): ...

363

```

364

365

### Kernels and Modules

366

367

CUDA kernel compilation and execution management.

368

369

```python { .api }

370

class Function:

371

"""

372

CUDA function object.

373

374

Represents a compiled CUDA kernel function that can be launched

375

with specified grid and block dimensions.

376

"""

377

def __init__(self, module, name):

378

"""

379

Parameters:

380

module: Module - CUDA module containing the function

381

name: str - Function name

382

"""

383

384

def __call__(self, grid, block, args, **kwargs): ...

385

386

@property

387

def attributes(self):

388

"""Get function attributes."""

389

390

class Module:

391

"""

392

CUDA module containing compiled device code.

393

394

Modules contain one or more CUDA kernels and can be loaded

395

from PTX or CUBIN code.

396

"""

397

def __init__(self): ...

398

399

def get_function(self, name): ...

400

def get_global(self, name): ...

401

def get_texref(self, name): ...

402

403

@classmethod

404

def load_file(cls, filename): ...

405

406

@classmethod

407

def load_from_string(cls, source): ...

408

```

409

410

### Memory Hooks

411

412

Hooks for monitoring and controlling memory allocation behavior.

413

414

```python { .api }

415

class MemoryHook:

416

"""

417

Base class for memory allocation hooks.

418

419

Memory hooks allow monitoring and customization of memory

420

allocation and deallocation operations.

421

"""

422

def alloc_preprocess(self, **kwargs): ...

423

def alloc_postprocess(self, mem): ...

424

def free_preprocess(self, mem): ...

425

def free_postprocess(self, mem): ...

426

```

427

428

### Profiling and Debugging

429

430

Tools for profiling and debugging CUDA applications.

431

432

```python { .api }

433

def profile():

434

"""

435

Context manager for CUDA profiling (deprecated).

436

437

Note: This is deprecated. Use cupyx.profiler.profile() instead.

438

"""

439

```

440

441

### Environment Information

442

443

Functions for querying CUDA runtime and environment information.

444

445

```python { .api }

446

def get_local_runtime_version():

447

"""

448

Get the local CUDA runtime version.

449

"""

450

451

def get_cuda_path():

452

"""

453

Get the CUDA installation path.

454

"""

455

456

def get_nvcc_path():

457

"""

458

Get the path to nvcc compiler.

459

"""

460

461

def get_rocm_path():

462

"""

463

Get the ROCm installation path (for AMD GPUs).

464

"""

465

466

def get_hipcc_path():

467

"""

468

Get the path to hipcc compiler (for AMD GPUs).

469

"""

470

```

471

472

### Low-level API Access

473

474

Access to low-level CUDA APIs for advanced users.

475

476

```python { .api }

477

# CUDA Driver API

478

driver = cupy.cuda.driver

479

480

# CUDA Runtime API

481

runtime = cupy.cuda.runtime

482

483

# NVRTC Compiler API

484

nvrtc = cupy.cuda.nvrtc

485

486

# Backend library wrappers (lazy-loaded)

487

cublas = cupy.cuda.cublas # cuBLAS operations

488

cusolver = cupy.cuda.cusolver # cuSOLVER linear algebra

489

cusparse = cupy.cuda.cusparse # cuSPARSE sparse operations

490

curand = cupy.cuda.curand # cuRAND random numbers

491

nvtx = cupy.cuda.nvtx # NVTX profiling markers

492

```

493

494

## Usage Examples

495

496

```python

497

import cupy as cp

498

import cupy.cuda as cuda

499

500

# Device management

501

print(f"Current device: {cuda.get_device_id()}")

502

print(f"CUDA available: {cuda.is_available()}")

503

504

# Using specific devices

505

with cuda.Device(0):

506

# Operations on device 0

507

x = cp.array([1, 2, 3])

508

509

with cuda.Device(1): # If multiple GPUs available

510

# Operations on device 1

511

y = cp.array([4, 5, 6])

512

513

# Memory management

514

# Direct memory allocation

515

mem = cuda.alloc(1024) # Allocate 1KB

516

ptr = cuda.MemoryPointer(mem, 0, 1024)

517

518

# Using memory pools (recommended)

519

pool = cuda.MemoryPool()

520

with cuda.using_allocator(pool.malloc):

521

# All allocations use the pool

522

large_array = cp.zeros((10000, 10000))

523

524

# Memory pool statistics

525

print(f"Used memory: {pool.used_bytes()} bytes")

526

print(f"Total memory: {pool.total_bytes()} bytes")

527

528

# Stream management for asynchronous operations

529

stream1 = cuda.Stream()

530

stream2 = cuda.Stream()

531

532

with stream1:

533

# Operations executed on stream1

534

a = cp.random.rand(1000, 1000)

535

b = cp.random.rand(1000, 1000)

536

537

with stream2:

538

# Operations executed on stream2 (can overlap with stream1)

539

c = cp.random.rand(1000, 1000)

540

d = cp.random.rand(1000, 1000)

541

542

# Synchronization

543

stream1.synchronize() # Wait for stream1 to complete

544

stream2.synchronize() # Wait for stream2 to complete

545

546

# Event-based synchronization

547

event = cuda.Event()

548

with stream1:

549

result1 = cp.dot(a, b)

550

event.record() # Record completion of operations

551

552

with stream2:

553

stream2.wait_event(event) # Wait for stream1 operations

554

result2 = cp.dot(c, d) + result1 # Uses result from stream1

555

556

# Measuring execution time with events

557

start_event = cuda.Event()

558

end_event = cuda.Event()

559

560

start_event.record()

561

# Some operations

562

large_computation = cp.dot(cp.random.rand(5000, 5000),

563

cp.random.rand(5000, 5000))

564

end_event.record()

565

end_event.synchronize()

566

567

elapsed_ms = cuda.get_elapsed_time(start_event, end_event)

568

print(f"Computation took {elapsed_ms} ms")

569

570

# Pinned memory for faster transfers

571

pinned_mem = cuda.alloc_pinned_memory(1000 * 8) # 1000 float64s

572

pinned_array = cp.ndarray((1000,), dtype=cp.float64,

573

memptr=cuda.MemoryPointer(pinned_mem, 0, 1000 * 8))

574

575

# Custom kernel example using RawKernel

576

kernel_code = r'''

577

extern "C" __global__

578

void vector_add(float* x, float* y, float* z, int n) {

579

int tid = blockDim.x * blockIdx.x + threadIdx.x;

580

if (tid < n) {

581

z[tid] = x[tid] + y[tid];

582

}

583

}

584

'''

585

586

kernel = cp.RawKernel(kernel_code, 'vector_add')

587

588

# Launch custom kernel

589

n = 1000

590

x = cp.random.rand(n, dtype=cp.float32)

591

y = cp.random.rand(n, dtype=cp.float32)

592

z = cp.zeros(n, dtype=cp.float32)

593

594

# Launch with appropriate grid/block size

595

threads_per_block = 256

596

blocks_per_grid = (n + threads_per_block - 1) // threads_per_block

597

kernel((blocks_per_grid,), (threads_per_block,), (x, y, z, n))

598

599

# Memory hooks for monitoring

600

class MemoryTracker(cuda.MemoryHook):

601

def __init__(self):

602

self.allocated_bytes = 0

603

self.freed_bytes = 0

604

605

def alloc_postprocess(self, mem):

606

self.allocated_bytes += mem.size

607

print(f"Allocated {mem.size} bytes")

608

609

def free_preprocess(self, mem):

610

self.freed_bytes += mem.size

611

print(f"Freed {mem.size} bytes")

612

613

tracker = MemoryTracker()

614

# Note: Memory hooks integration depends on CuPy version

615

616

# Working with CUDA graphs (for CUDA 10.0+)

617

if hasattr(cuda, 'Graph'):

618

graph = cuda.Graph()

619

620

# Capture operations in a graph

621

stream = cuda.Stream()

622

with stream:

623

graph.begin_capture(stream)

624

625

# Operations to be captured

626

x = cp.random.rand(1000, 1000)

627

y = cp.random.rand(1000, 1000)

628

z = x @ y

629

630

graph.end_capture(stream)

631

632

# Replay the graph multiple times efficiently

633

for _ in range(10):

634

graph.launch(stream)

635

stream.synchronize()

636

637

# Multi-GPU computation example

638

def multi_gpu_computation(data_list):

639

"""Distribute computation across multiple GPUs."""

640

n_gpus = cuda.runtime.getDeviceCount()

641

streams = []

642

results = []

643

644

for i, data in enumerate(data_list[:n_gpus]):

645

device_id = i % n_gpus

646

with cuda.Device(device_id):

647

stream = cuda.Stream()

648

streams.append(stream)

649

650

with stream:

651

# Transfer data to this GPU

652

gpu_data = cp.asarray(data)

653

# Perform computation

654

result = cp.sum(gpu_data ** 2)

655

results.append(result)

656

657

# Synchronize all streams

658

for stream in streams:

659

stream.synchronize()

660

661

return results

662

663

# Memory bandwidth benchmark

664

def memory_bandwidth_test(size_mb=100):

665

"""Test memory bandwidth between host and device."""

666

size_bytes = size_mb * 1024 * 1024

667

668

# Host memory

669

host_data = cp.asnumpy(cp.random.rand(size_bytes // 8))

670

671

# Pinned host memory for faster transfers

672

pinned_mem = cuda.alloc_pinned_memory(size_bytes)

673

674

# Time regular vs pinned memory transfers

675

import time

676

677

# Regular host memory

678

start = time.time()

679

gpu_data1 = cp.asarray(host_data)

680

cp.cuda.synchronize()

681

regular_time = time.time() - start

682

683

# Pinned memory (requires copying to pinned first)

684

start = time.time()

685

# Copy to pinned then to GPU would be done here

686

# This is a simplified example

687

pinned_time = time.time() - start

688

689

bandwidth_regular = size_mb / regular_time

690

print(f"Regular memory bandwidth: {bandwidth_regular:.2f} MB/s")

691

692

# Advanced memory pool configuration

693

def configure_memory_pool():

694

"""Configure memory pool for optimal performance."""

695

# Get the default memory pool

696

mempool = cp.get_default_memory_pool()

697

698

# Set memory pool growth strategy

699

# mempool.set_limit(size=2**30) # Limit to 1GB

700

701

# Monitor memory usage

702

print(f"Used bytes: {mempool.used_bytes()}")

703

print(f"Total bytes: {mempool.total_bytes()}")

704

705

# Force cleanup of unused memory

706

mempool.free_all_free()

707

708

return mempool

709

710

# Context management for robust error handling

711

def safe_gpu_computation():

712

"""Example of robust GPU computation with proper cleanup."""

713

stream = None

714

temp_arrays = []

715

716

try:

717

stream = cuda.Stream()

718

719

with stream:

720

# Temporary arrays that need cleanup

721

temp1 = cp.random.rand(10000, 10000)

722

temp2 = cp.random.rand(10000, 10000)

723

temp_arrays.extend([temp1, temp2])

724

725

# Main computation

726

result = temp1 @ temp2

727

728

# Synchronize to ensure completion

729

stream.synchronize()

730

731

return result

732

733

except Exception as e:

734

print(f"GPU computation failed: {e}")

735

return None

736

737

finally:

738

# Cleanup resources

739

if stream:

740

stream.synchronize()

741

742

# Force garbage collection of temporary arrays

743

del temp_arrays

744

cp.get_default_memory_pool().free_all_free()

745

```

746

747

## Performance Optimization Tips

748

749

### Memory Management

750

751

```python

752

# Use memory pools to reduce allocation overhead

753

with cuda.using_allocator(cp.get_default_memory_pool().malloc):

754

# All allocations reuse memory from the pool

755

data = cp.zeros((10000, 10000))

756

757

# Pre-allocate large arrays when possible

758

workspace = cp.zeros((10000, 10000)) # Reuse this array

759

760

# Use appropriate memory types

761

regular_mem = cuda.alloc(1024) # Regular device memory

762

managed_mem = cuda.malloc_managed(1024) # Unified memory

763

```

764

765

### Stream Optimization

766

767

```python

768

# Overlap computation and memory transfers

769

compute_stream = cuda.Stream()

770

transfer_stream = cuda.Stream()

771

772

with transfer_stream:

773

# Asynchronous memory transfer

774

next_data = cp.asarray(host_data)

775

776

with compute_stream:

777

# Parallel computation

778

result = process_current_data(current_data)

779

```

780

781

### Kernel Launch Optimization

782

783

```python

784

# Choose optimal grid/block dimensions

785

def optimal_launch_config(n, max_threads_per_block=1024):

786

"""Calculate optimal CUDA launch configuration."""

787

if n <= max_threads_per_block:

788

return (1, n)

789

else:

790

threads_per_block = max_threads_per_block

791

blocks_per_grid = (n + threads_per_block - 1) // threads_per_block

792

return (blocks_per_grid, threads_per_block)

793

794

grid, block = optimal_launch_config(1000000)

795

```

796

797

CUDA integration in CuPy provides comprehensive low-level GPU programming capabilities, enabling advanced memory management, asynchronous execution, custom kernel development, and performance optimization for high-performance computing applications while maintaining compatibility with the broader CUDA ecosystem.