or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

array-operations.mdcuda-integration.mdcustom-kernels.mdfft.mdindex.mdio-operations.mdjit-compilation.mdlinear-algebra.mdmathematical-functions.mdperformance-profiling.mdpolynomial-operations.mdrandom.mdscipy-extensions.md

performance-profiling.mddocs/

0

# Performance Profiling

1

2

CuPy provides comprehensive performance profiling and benchmarking tools through the `cupyx.profiler` module, enabling developers to measure execution times, analyze GPU utilization, identify performance bottlenecks, and optimize CUDA applications for maximum throughput and efficiency.

3

4

## Capabilities

5

6

### Timing and Benchmarking

7

8

Core timing utilities for measuring execution performance of CuPy operations and custom kernels.

9

10

```python { .api }

11

def benchmark(func, args=(), kwargs=None, n_warmup=1, n_repeat=3, n_run=1):

12

"""

13

Benchmark a function with comprehensive timing statistics.

14

15

Executes the function multiple times and provides detailed

16

timing statistics including mean, standard deviation, min/max

17

execution times, and GPU/CPU timing analysis.

18

19

Parameters:

20

func: callable - Function to benchmark

21

args: tuple, optional - Positional arguments for function

22

kwargs: dict, optional - Keyword arguments for function

23

n_warmup: int, optional - Number of warmup runs (default 1)

24

n_repeat: int, optional - Number of timing repetitions (default 3)

25

n_run: int, optional - Number of function calls per repetition (default 1)

26

27

Returns:

28

dict: Benchmark results with timing statistics

29

"""

30

31

def time_range():

32

"""

33

Context manager for measuring execution time ranges.

34

35

Returns a context manager that measures the time between

36

entry and exit, accounting for GPU synchronization.

37

38

Returns:

39

TimeRangeContext: Context manager for timing

40

"""

41

42

class TimeRangeContext:

43

"""

44

Context manager for timing code execution ranges.

45

46

Provides precise timing measurements for GPU operations

47

with proper synchronization handling.

48

"""

49

def __enter__(self): ...

50

def __exit__(self, *args): ...

51

52

@property

53

def elapsed_time(self):

54

"""Get elapsed time in seconds."""

55

56

def profile():

57

"""

58

Context manager for comprehensive profiling.

59

60

Enables detailed profiling including NVTX markers,

61

memory usage tracking, and kernel execution analysis.

62

63

Returns:

64

ProfileContext: Context manager for profiling

65

"""

66

67

class ProfileContext:

68

"""

69

Context manager for comprehensive performance profiling.

70

71

Collects detailed performance metrics including timing,

72

memory usage, kernel launches, and GPU utilization.

73

"""

74

def __enter__(self): ...

75

def __exit__(self, *args): ...

76

77

def print_report(self):

78

"""Print detailed profiling report."""

79

80

def save_report(self, filename):

81

"""Save profiling report to file."""

82

```

83

84

### Memory Profiling

85

86

Tools for analyzing GPU memory usage patterns and identifying memory bottlenecks.

87

88

```python { .api }

89

def get_memory_info():

90

"""

91

Get current GPU memory usage information.

92

93

Returns:

94

dict: Memory usage statistics including total, used, and free memory

95

"""

96

97

def memory_profile():

98

"""

99

Context manager for memory usage profiling.

100

101

Tracks memory allocations and deallocations during execution

102

to identify memory usage patterns and potential leaks.

103

104

Returns:

105

MemoryProfileContext: Context manager for memory profiling

106

"""

107

108

class MemoryProfileContext:

109

"""

110

Context manager for tracking memory usage patterns.

111

112

Monitors GPU memory allocations, deallocations, and peak usage

113

during code execution.

114

"""

115

def __enter__(self): ...

116

def __exit__(self, *args): ...

117

118

@property

119

def peak_memory(self):

120

"""Peak memory usage during profiling."""

121

122

@property

123

def memory_allocations(self):

124

"""List of memory allocation events."""

125

126

def print_memory_report(self):

127

"""Print detailed memory usage report."""

128

129

def trace_memory(enabled=True):

130

"""

131

Enable or disable memory allocation tracing.

132

133

Parameters:

134

enabled: bool - Whether to enable memory tracing

135

"""

136

137

def get_memory_trace():

138

"""

139

Get memory allocation trace information.

140

141

Returns:

142

list: Memory allocation trace events

143

"""

144

```

145

146

### NVTX Integration

147

148

NVIDIA Tools Extension (NVTX) integration for advanced profiling with external tools.

149

150

```python { .api }

151

def nvtx_push(message, color=None):

152

"""

153

Push an NVTX range marker.

154

155

Creates a named range marker for profiling tools like Nsight

156

to identify code sections and their performance characteristics.

157

158

Parameters:

159

message: str - Range description

160

color: int, optional - Color code for the range

161

"""

162

163

def nvtx_pop():

164

"""Pop the most recent NVTX range marker."""

165

166

def nvtx_mark(message, color=None):

167

"""

168

Create an NVTX point marker.

169

170

Parameters:

171

message: str - Marker description

172

color: int, optional - Color code for the marker

173

"""

174

175

def nvtx_range_push(message, color=None):

176

"""

177

Push a named NVTX range (alias for nvtx_push).

178

179

Parameters:

180

message: str - Range name

181

color: int, optional - Color code

182

"""

183

184

def nvtx_range_pop():

185

"""Pop the current NVTX range (alias for nvtx_pop)."""

186

187

class NVTXRange:

188

"""

189

Context manager for NVTX range markers.

190

191

Automatically pushes and pops NVTX range markers for

192

convenient profiling of code blocks.

193

"""

194

def __init__(self, message, color=None):

195

"""

196

Parameters:

197

message: str - Range description

198

color: int, optional - Color code

199

"""

200

201

def __enter__(self): ...

202

def __exit__(self, *args): ...

203

204

def nvtx(message=None, color=None):

205

"""

206

Decorator or context manager for NVTX range marking.

207

208

Can be used as a decorator for functions or as a context manager

209

for code blocks to automatically add NVTX markers.

210

211

Parameters:

212

message: str, optional - Range description

213

color: int, optional - Color code

214

"""

215

```

216

217

### Kernel Performance Analysis

218

219

Tools for analyzing individual kernel performance and optimization opportunities.

220

221

```python { .api }

222

def kernel_profile():

223

"""

224

Context manager for kernel-specific profiling.

225

226

Tracks individual kernel launches, execution times,

227

and performance characteristics.

228

229

Returns:

230

KernelProfileContext: Context manager for kernel profiling

231

"""

232

233

class KernelProfileContext:

234

"""

235

Context manager for detailed kernel performance analysis.

236

237

Collects metrics for individual kernel launches including

238

execution time, occupancy, memory throughput, and compute utilization.

239

"""

240

def __enter__(self): ...

241

def __exit__(self, *args): ...

242

243

@property

244

def kernel_stats(self):

245

"""Statistics for executed kernels."""

246

247

def print_kernel_report(self):

248

"""Print detailed kernel analysis report."""

249

250

def get_kernel_info(kernel):

251

"""

252

Get information about a compiled kernel.

253

254

Parameters:

255

kernel: RawKernel or similar - Kernel object

256

257

Returns:

258

dict: Kernel information including occupancy and resource usage

259

"""

260

261

def analyze_occupancy(kernel, block_size, shared_mem=0):

262

"""

263

Analyze theoretical occupancy for a kernel configuration.

264

265

Parameters:

266

kernel: kernel object - Kernel to analyze

267

block_size: int - Block size (threads per block)

268

shared_mem: int, optional - Shared memory usage per block

269

270

Returns:

271

dict: Occupancy analysis results

272

"""

273

```

274

275

### Comparative Benchmarking

276

277

Tools for comparing performance between different implementations and configurations.

278

279

```python { .api }

280

def compare_implementations(*funcs, args=(), kwargs=None, names=None):

281

"""

282

Compare performance of multiple function implementations.

283

284

Benchmarks multiple functions with identical inputs and provides

285

comparative analysis of their performance characteristics.

286

287

Parameters:

288

*funcs: callable - Functions to compare

289

args: tuple, optional - Arguments for all functions

290

kwargs: dict, optional - Keyword arguments for all functions

291

names: list, optional - Names for each function

292

293

Returns:

294

dict: Comparative benchmark results

295

"""

296

297

def parameter_sweep(func, param_ranges, fixed_args=(), fixed_kwargs=None):

298

"""

299

Perform parameter sweep benchmarking.

300

301

Tests function performance across different parameter values

302

to identify optimal configurations.

303

304

Parameters:

305

func: callable - Function to benchmark

306

param_ranges: dict - Parameter names and value ranges

307

fixed_args: tuple, optional - Fixed positional arguments

308

fixed_kwargs: dict, optional - Fixed keyword arguments

309

310

Returns:

311

dict: Parameter sweep results

312

"""

313

314

def scaling_analysis(func, data_sizes, *args, **kwargs):

315

"""

316

Analyze performance scaling with different data sizes.

317

318

Parameters:

319

func: callable - Function to analyze

320

data_sizes: list - Different input sizes to test

321

*args: Additional function arguments

322

**kwargs: Additional function keyword arguments

323

324

Returns:

325

dict: Scaling analysis results

326

"""

327

```

328

329

## Usage Examples

330

331

### Basic Benchmarking

332

333

```python

334

import cupy as cp

335

from cupyx import profiler

336

337

# Simple function benchmarking

338

def matrix_multiply(a, b):

339

return cp.dot(a, b)

340

341

# Create test matrices

342

size = 2048

343

a = cp.random.rand(size, size, dtype=cp.float32)

344

b = cp.random.rand(size, size, dtype=cp.float32)

345

346

# Benchmark the function

347

results = profiler.benchmark(

348

matrix_multiply,

349

args=(a, b),

350

n_warmup=3,

351

n_repeat=10,

352

n_run=1

353

)

354

355

print(f"Mean execution time: {results['mean']:.4f} seconds")

356

print(f"Standard deviation: {results['std']:.4f} seconds")

357

print(f"Min time: {results['min']:.4f} seconds")

358

print(f"Max time: {results['max']:.4f} seconds")

359

print(f"Throughput: {results['throughput']:.2f} GFLOPS")

360

```

361

362

### Time Range Profiling

363

364

```python

365

# Using time_range for custom timing

366

with profiler.time_range() as timer:

367

# Complex computation sequence

368

x = cp.random.rand(10000, 10000)

369

y = cp.fft.fft2(x)

370

z = cp.abs(y) ** 2

371

result = cp.sum(z)

372

373

# Ensure all operations complete

374

cp.cuda.synchronize()

375

376

print(f"Total execution time: {timer.elapsed_time:.4f} seconds")

377

378

# Multiple timing ranges

379

operations = {}

380

381

with profiler.time_range() as timer:

382

data = cp.random.rand(5000, 5000)

383

operations['data_generation'] = timer.elapsed_time

384

385

with profiler.time_range() as timer:

386

processed = cp.sin(data) * cp.cos(data)

387

operations['trigonometric'] = timer.elapsed_time

388

389

with profiler.time_range() as timer:

390

result = cp.linalg.svd(processed[:1000, :1000])

391

operations['svd'] = timer.elapsed_time

392

393

for op, time in operations.items():

394

print(f"{op}: {time:.4f} seconds")

395

```

396

397

### Memory Profiling

398

399

```python

400

# Memory usage analysis

401

with profiler.memory_profile() as mem_prof:

402

# Allocate large arrays

403

arrays = []

404

for i in range(10):

405

arr = cp.random.rand(1000, 1000)

406

arrays.append(arr)

407

408

# Perform operations that may fragment memory

409

results = []

410

for arr in arrays:

411

processed = cp.fft.fft2(arr)

412

filtered = cp.abs(processed) > 0.5

413

results.append(cp.sum(filtered))

414

415

# Cleanup some arrays

416

del arrays[:5]

417

418

print(f"Peak memory usage: {mem_prof.peak_memory / 1024**3:.2f} GB")

419

mem_prof.print_memory_report()

420

421

# Memory trace analysis

422

profiler.trace_memory(True)

423

424

# Operations to trace

425

large_array = cp.zeros((10000, 10000))

426

temp_arrays = [cp.random.rand(1000, 1000) for _ in range(50)]

427

del temp_arrays # Free memory

428

429

# Get memory trace

430

trace = profiler.get_memory_trace()

431

print(f"Number of memory operations: {len(trace)}")

432

433

profiler.trace_memory(False)

434

```

435

436

### NVTX Profiling Integration

437

438

```python

439

# Using NVTX markers for external profiling tools

440

@profiler.nvtx("matrix_operations", color=0xFF0000)

441

def complex_matrix_operations(data):

442

"""Function with NVTX profiling markers."""

443

444

with profiler.NVTXRange("preprocessing", color=0x00FF00):

445

# Data preprocessing

446

normalized = (data - cp.mean(data)) / cp.std(data)

447

scaled = normalized * 2.0

448

449

with profiler.NVTXRange("computation", color=0x0000FF):

450

# Main computation

451

result = cp.linalg.matrix_power(scaled, 3)

452

eigenvals = cp.linalg.eigvals(result)

453

454

with profiler.NVTXRange("postprocessing", color=0xFFFF00):

455

# Postprocessing

456

sorted_vals = cp.sort(eigenvals)

457

final_result = cp.real(sorted_vals)

458

459

return final_result

460

461

# Use the profiled function

462

test_matrix = cp.random.rand(500, 500, dtype=cp.complex64)

463

result = complex_matrix_operations(test_matrix)

464

465

# Manual NVTX markers

466

profiler.nvtx_mark("Starting algorithm", color=0xFF00FF)

467

468

profiler.nvtx_push("Algorithm Phase 1", color=0x00FFFF)

469

# Phase 1 operations

470

phase1_data = cp.random.rand(1000, 1000)

471

phase1_result = cp.sum(phase1_data, axis=0)

472

profiler.nvtx_pop()

473

474

profiler.nvtx_push("Algorithm Phase 2", color=0xFF8000)

475

# Phase 2 operations

476

phase2_result = cp.cumsum(phase1_result)

477

profiler.nvtx_pop()

478

479

profiler.nvtx_mark("Algorithm completed", color=0x8000FF)

480

```

481

482

### Comprehensive Profiling

483

484

```python

485

# Full profiling session

486

with profiler.profile() as prof:

487

# Data preparation

488

print("Preparing data...")

489

data_size = 8192

490

matrix_a = cp.random.rand(data_size, data_size, dtype=cp.float32)

491

matrix_b = cp.random.rand(data_size, data_size, dtype=cp.float32)

492

493

# Matrix multiplication

494

print("Performing matrix multiplication...")

495

result_mm = cp.dot(matrix_a, matrix_b)

496

497

# FFT operations

498

print("Performing FFT...")

499

fft_data = cp.random.rand(data_size, data_size, dtype=cp.complex64)

500

fft_result = cp.fft.fft2(fft_data)

501

502

# Reduction operations

503

print("Performing reductions...")

504

sum_result = cp.sum(result_mm)

505

mean_result = cp.mean(fft_result)

506

507

# Linear algebra

508

print("Performing linear algebra...")

509

smaller_matrix = matrix_a[:1000, :1000]

510

eigenvals = cp.linalg.eigvals(smaller_matrix)

511

512

# Custom kernel

513

print("Running custom kernel...")

514

@cp.ElementwiseKernel('T x, T y', 'T z', 'z = sqrt(x*x + y*y)')

515

def magnitude_kernel(x, y):

516

pass

517

518

mag_result = magnitude_kernel(matrix_a, matrix_b)

519

520

# Print comprehensive report

521

prof.print_report()

522

523

# Save report to file

524

prof.save_report("profiling_report.txt")

525

```

526

527

### Kernel Performance Analysis

528

529

```python

530

# Analyze custom kernel performance

531

kernel_code = r'''

532

extern "C" __global__

533

void optimized_reduction(float* input, float* output, int n) {

534

extern __shared__ float sdata[];

535

536

unsigned int tid = threadIdx.x;

537

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

538

539

sdata[tid] = (i < n) ? input[i] : 0.0f;

540

__syncthreads();

541

542

for (int s = blockDim.x / 2; s > 0; s >>= 1) {

543

if (tid < s) {

544

sdata[tid] += sdata[tid + s];

545

}

546

__syncthreads();

547

}

548

549

if (tid == 0) output[blockIdx.x] = sdata[0];

550

}

551

'''

552

553

reduction_kernel = cp.RawKernel(kernel_code, 'optimized_reduction')

554

555

# Analyze kernel performance

556

with profiler.kernel_profile() as kernel_prof:

557

input_data = cp.random.rand(1000000, dtype=cp.float32)

558

block_size = 256

559

grid_size = (input_data.size + block_size - 1) // block_size

560

output = cp.zeros(grid_size, dtype=cp.float32)

561

562

# Launch kernel multiple times

563

for _ in range(100):

564

reduction_kernel(

565

(grid_size,),

566

(block_size,),

567

(input_data, output, input_data.size),

568

shared_mem=block_size * 4

569

)

570

571

kernel_prof.print_kernel_report()

572

573

# Occupancy analysis

574

occupancy_info = profiler.analyze_occupancy(

575

reduction_kernel,

576

block_size=256,

577

shared_mem=256 * 4

578

)

579

580

print("Occupancy Analysis:")

581

print(f"Theoretical occupancy: {occupancy_info['theoretical_occupancy']:.2%}")

582

print(f"Blocks per SM: {occupancy_info['blocks_per_sm']}")

583

print(f"Threads per SM: {occupancy_info['threads_per_sm']}")

584

```

585

586

### Comparative Benchmarking

587

588

```python

589

# Compare different matrix multiplication implementations

590

def cupy_dot(a, b):

591

return cp.dot(a, b)

592

593

def cupy_matmul(a, b):

594

return cp.matmul(a, b)

595

596

def cupy_einsum(a, b):

597

return cp.einsum('ij,jk->ik', a, b)

598

599

# Prepare test matrices

600

size = 2048

601

a = cp.random.rand(size, size, dtype=cp.float32)

602

b = cp.random.rand(size, size, dtype=cp.float32)

603

604

# Compare implementations

605

comparison = profiler.compare_implementations(

606

cupy_dot, cupy_matmul, cupy_einsum,

607

args=(a, b),

608

names=['cp.dot', 'cp.matmul', 'cp.einsum']

609

)

610

611

print("Performance Comparison:")

612

for name, stats in comparison.items():

613

print(f"{name:12}: {stats['mean']:.4f}s ± {stats['std']:.4f}s")

614

615

# Parameter sweep for optimal block size

616

def custom_kernel_test(data, block_size):

617

# Custom kernel with configurable block size

618

threads_per_block = block_size

619

blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block

620

621

result = cp.zeros(blocks_per_grid)

622

# Kernel launch would go here

623

return result

624

625

data = cp.random.rand(1000000)

626

param_ranges = {'block_size': [64, 128, 256, 512, 1024]}

627

628

sweep_results = profiler.parameter_sweep(

629

custom_kernel_test,

630

param_ranges,

631

fixed_args=(data,)

632

)

633

634

print("Parameter Sweep Results:")

635

for params, timing in sweep_results.items():

636

print(f"Block size {params['block_size']}: {timing['mean']:.4f}s")

637

```

638

639

### Scaling Analysis

640

641

```python

642

# Analyze how performance scales with data size

643

def scaling_test_function(data):

644

# Test function that should scale with data size

645

result = cp.fft.fft(data)

646

magnitude = cp.abs(result)

647

return cp.sum(magnitude)

648

649

# Test with different data sizes

650

data_sizes = [1000, 5000, 10000, 50000, 100000, 500000, 1000000]

651

652

scaling_results = profiler.scaling_analysis(

653

scaling_test_function,

654

data_sizes,

655

dtype=cp.complex64

656

)

657

658

print("Scaling Analysis:")

659

print("Size\t\tTime (s)\tThroughput (MB/s)")

660

for size, stats in scaling_results.items():

661

throughput = (size * 8) / (stats['mean'] * 1024**2) # Complex64 = 8 bytes

662

print(f"{size:8}\t{stats['mean']:.4f}\t\t{throughput:.2f}")

663

664

# Memory bandwidth test

665

def memory_bandwidth_test(size):

666

"""Test memory bandwidth with different array sizes."""

667

data = cp.random.rand(size, dtype=cp.float32)

668

return cp.sum(data)

669

670

memory_sizes = [10**i for i in range(4, 8)] # 10K to 10M elements

671

bandwidth_results = profiler.scaling_analysis(

672

memory_bandwidth_test,

673

memory_sizes

674

)

675

676

print("\nMemory Bandwidth Analysis:")

677

for size, stats in bandwidth_results.items():

678

bandwidth_gbps = (size * 4) / (stats['mean'] * 1024**3) # Float32 = 4 bytes

679

print(f"Size: {size:8} elements, Bandwidth: {bandwidth_gbps:.2f} GB/s")

680

```

681

682

### Advanced Profiling Workflows

683

684

```python

685

# Production profiling workflow

686

class ProductionProfiler:

687

def __init__(self, enable_profiling=True):

688

self.enable_profiling = enable_profiling

689

self.profiles = {}

690

691

def profile_section(self, name):

692

"""Context manager for profiling code sections."""

693

if not self.enable_profiling:

694

return profiler.time_range() # No-op profiler

695

696

return profiler.time_range()

697

698

def benchmark_operation(self, name, func, *args, **kwargs):

699

"""Benchmark a specific operation."""

700

if not self.enable_profiling:

701

return func(*args, **kwargs)

702

703

with profiler.time_range() as timer:

704

result = func(*args, **kwargs)

705

706

self.profiles[name] = timer.elapsed_time

707

return result

708

709

def print_summary(self):

710

"""Print profiling summary."""

711

if not self.profiles:

712

print("No profiling data collected")

713

return

714

715

print("Performance Summary:")

716

print("-" * 40)

717

total_time = sum(self.profiles.values())

718

719

for name, time in sorted(self.profiles.items(), key=lambda x: x[1], reverse=True):

720

percentage = (time / total_time) * 100

721

print(f"{name:25}: {time:.4f}s ({percentage:.1f}%)")

722

723

print("-" * 40)

724

print(f"Total time: {total_time:.4f}s")

725

726

# Use production profiler

727

profiler_instance = ProductionProfiler(enable_profiling=True)

728

729

# Profile different operations

730

data = profiler_instance.benchmark_operation(

731

"data_generation",

732

cp.random.rand,

733

5000, 5000

734

)

735

736

fft_result = profiler_instance.benchmark_operation(

737

"fft_computation",

738

cp.fft.fft2,

739

data

740

)

741

742

with profiler_instance.profile_section("postprocessing") as timer:

743

magnitude = cp.abs(fft_result)

744

result = cp.sum(magnitude)

745

profiler_instance.profiles["postprocessing"] = timer.elapsed_time

746

747

# Print comprehensive summary

748

profiler_instance.print_summary()

749

```

750

751

Performance profiling in CuPy provides essential tools for optimizing GPU applications, identifying bottlenecks, measuring execution characteristics, and ensuring optimal utilization of GPU resources across different computational workloads and hardware configurations.