0
# Performance Profiling
1
2
CuPy provides comprehensive performance profiling and benchmarking tools through the `cupyx.profiler` module, enabling developers to measure execution times, analyze GPU utilization, identify performance bottlenecks, and optimize CUDA applications for maximum throughput and efficiency.
3
4
## Capabilities
5
6
### Timing and Benchmarking
7
8
Core timing utilities for measuring execution performance of CuPy operations and custom kernels.
9
10
```python { .api }
11
def benchmark(func, args=(), kwargs=None, n_warmup=1, n_repeat=3, n_run=1):
12
"""
13
Benchmark a function with comprehensive timing statistics.
14
15
Executes the function multiple times and provides detailed
16
timing statistics including mean, standard deviation, min/max
17
execution times, and GPU/CPU timing analysis.
18
19
Parameters:
20
func: callable - Function to benchmark
21
args: tuple, optional - Positional arguments for function
22
kwargs: dict, optional - Keyword arguments for function
23
n_warmup: int, optional - Number of warmup runs (default 1)
24
n_repeat: int, optional - Number of timing repetitions (default 3)
25
n_run: int, optional - Number of function calls per repetition (default 1)
26
27
Returns:
28
dict: Benchmark results with timing statistics
29
"""
30
31
def time_range():
32
"""
33
Context manager for measuring execution time ranges.
34
35
Returns a context manager that measures the time between
36
entry and exit, accounting for GPU synchronization.
37
38
Returns:
39
TimeRangeContext: Context manager for timing
40
"""
41
42
class TimeRangeContext:
43
"""
44
Context manager for timing code execution ranges.
45
46
Provides precise timing measurements for GPU operations
47
with proper synchronization handling.
48
"""
49
def __enter__(self): ...
50
def __exit__(self, *args): ...
51
52
@property
53
def elapsed_time(self):
54
"""Get elapsed time in seconds."""
55
56
def profile():
57
"""
58
Context manager for comprehensive profiling.
59
60
Enables detailed profiling including NVTX markers,
61
memory usage tracking, and kernel execution analysis.
62
63
Returns:
64
ProfileContext: Context manager for profiling
65
"""
66
67
class ProfileContext:
68
"""
69
Context manager for comprehensive performance profiling.
70
71
Collects detailed performance metrics including timing,
72
memory usage, kernel launches, and GPU utilization.
73
"""
74
def __enter__(self): ...
75
def __exit__(self, *args): ...
76
77
def print_report(self):
78
"""Print detailed profiling report."""
79
80
def save_report(self, filename):
81
"""Save profiling report to file."""
82
```
83
84
### Memory Profiling
85
86
Tools for analyzing GPU memory usage patterns and identifying memory bottlenecks.
87
88
```python { .api }
89
def get_memory_info():
90
"""
91
Get current GPU memory usage information.
92
93
Returns:
94
dict: Memory usage statistics including total, used, and free memory
95
"""
96
97
def memory_profile():
98
"""
99
Context manager for memory usage profiling.
100
101
Tracks memory allocations and deallocations during execution
102
to identify memory usage patterns and potential leaks.
103
104
Returns:
105
MemoryProfileContext: Context manager for memory profiling
106
"""
107
108
class MemoryProfileContext:
109
"""
110
Context manager for tracking memory usage patterns.
111
112
Monitors GPU memory allocations, deallocations, and peak usage
113
during code execution.
114
"""
115
def __enter__(self): ...
116
def __exit__(self, *args): ...
117
118
@property
119
def peak_memory(self):
120
"""Peak memory usage during profiling."""
121
122
@property
123
def memory_allocations(self):
124
"""List of memory allocation events."""
125
126
def print_memory_report(self):
127
"""Print detailed memory usage report."""
128
129
def trace_memory(enabled=True):
130
"""
131
Enable or disable memory allocation tracing.
132
133
Parameters:
134
enabled: bool - Whether to enable memory tracing
135
"""
136
137
def get_memory_trace():
138
"""
139
Get memory allocation trace information.
140
141
Returns:
142
list: Memory allocation trace events
143
"""
144
```
145
146
### NVTX Integration
147
148
NVIDIA Tools Extension (NVTX) integration for advanced profiling with external tools.
149
150
```python { .api }
151
def nvtx_push(message, color=None):
152
"""
153
Push an NVTX range marker.
154
155
Creates a named range marker for profiling tools like Nsight
156
to identify code sections and their performance characteristics.
157
158
Parameters:
159
message: str - Range description
160
color: int, optional - Color code for the range
161
"""
162
163
def nvtx_pop():
164
"""Pop the most recent NVTX range marker."""
165
166
def nvtx_mark(message, color=None):
167
"""
168
Create an NVTX point marker.
169
170
Parameters:
171
message: str - Marker description
172
color: int, optional - Color code for the marker
173
"""
174
175
def nvtx_range_push(message, color=None):
176
"""
177
Push a named NVTX range (alias for nvtx_push).
178
179
Parameters:
180
message: str - Range name
181
color: int, optional - Color code
182
"""
183
184
def nvtx_range_pop():
185
"""Pop the current NVTX range (alias for nvtx_pop)."""
186
187
class NVTXRange:
188
"""
189
Context manager for NVTX range markers.
190
191
Automatically pushes and pops NVTX range markers for
192
convenient profiling of code blocks.
193
"""
194
def __init__(self, message, color=None):
195
"""
196
Parameters:
197
message: str - Range description
198
color: int, optional - Color code
199
"""
200
201
def __enter__(self): ...
202
def __exit__(self, *args): ...
203
204
def nvtx(message=None, color=None):
205
"""
206
Decorator or context manager for NVTX range marking.
207
208
Can be used as a decorator for functions or as a context manager
209
for code blocks to automatically add NVTX markers.
210
211
Parameters:
212
message: str, optional - Range description
213
color: int, optional - Color code
214
"""
215
```
216
217
### Kernel Performance Analysis
218
219
Tools for analyzing individual kernel performance and optimization opportunities.
220
221
```python { .api }
222
def kernel_profile():
223
"""
224
Context manager for kernel-specific profiling.
225
226
Tracks individual kernel launches, execution times,
227
and performance characteristics.
228
229
Returns:
230
KernelProfileContext: Context manager for kernel profiling
231
"""
232
233
class KernelProfileContext:
234
"""
235
Context manager for detailed kernel performance analysis.
236
237
Collects metrics for individual kernel launches including
238
execution time, occupancy, memory throughput, and compute utilization.
239
"""
240
def __enter__(self): ...
241
def __exit__(self, *args): ...
242
243
@property
244
def kernel_stats(self):
245
"""Statistics for executed kernels."""
246
247
def print_kernel_report(self):
248
"""Print detailed kernel analysis report."""
249
250
def get_kernel_info(kernel):
251
"""
252
Get information about a compiled kernel.
253
254
Parameters:
255
kernel: RawKernel or similar - Kernel object
256
257
Returns:
258
dict: Kernel information including occupancy and resource usage
259
"""
260
261
def analyze_occupancy(kernel, block_size, shared_mem=0):
262
"""
263
Analyze theoretical occupancy for a kernel configuration.
264
265
Parameters:
266
kernel: kernel object - Kernel to analyze
267
block_size: int - Block size (threads per block)
268
shared_mem: int, optional - Shared memory usage per block
269
270
Returns:
271
dict: Occupancy analysis results
272
"""
273
```
274
275
### Comparative Benchmarking
276
277
Tools for comparing performance between different implementations and configurations.
278
279
```python { .api }
280
def compare_implementations(*funcs, args=(), kwargs=None, names=None):
281
"""
282
Compare performance of multiple function implementations.
283
284
Benchmarks multiple functions with identical inputs and provides
285
comparative analysis of their performance characteristics.
286
287
Parameters:
288
*funcs: callable - Functions to compare
289
args: tuple, optional - Arguments for all functions
290
kwargs: dict, optional - Keyword arguments for all functions
291
names: list, optional - Names for each function
292
293
Returns:
294
dict: Comparative benchmark results
295
"""
296
297
def parameter_sweep(func, param_ranges, fixed_args=(), fixed_kwargs=None):
298
"""
299
Perform parameter sweep benchmarking.
300
301
Tests function performance across different parameter values
302
to identify optimal configurations.
303
304
Parameters:
305
func: callable - Function to benchmark
306
param_ranges: dict - Parameter names and value ranges
307
fixed_args: tuple, optional - Fixed positional arguments
308
fixed_kwargs: dict, optional - Fixed keyword arguments
309
310
Returns:
311
dict: Parameter sweep results
312
"""
313
314
def scaling_analysis(func, data_sizes, *args, **kwargs):
315
"""
316
Analyze performance scaling with different data sizes.
317
318
Parameters:
319
func: callable - Function to analyze
320
data_sizes: list - Different input sizes to test
321
*args: Additional function arguments
322
**kwargs: Additional function keyword arguments
323
324
Returns:
325
dict: Scaling analysis results
326
"""
327
```
328
329
## Usage Examples
330
331
### Basic Benchmarking
332
333
```python
334
import cupy as cp
335
from cupyx import profiler
336
337
# Simple function benchmarking
338
def matrix_multiply(a, b):
339
return cp.dot(a, b)
340
341
# Create test matrices
342
size = 2048
343
a = cp.random.rand(size, size, dtype=cp.float32)
344
b = cp.random.rand(size, size, dtype=cp.float32)
345
346
# Benchmark the function
347
results = profiler.benchmark(
348
matrix_multiply,
349
args=(a, b),
350
n_warmup=3,
351
n_repeat=10,
352
n_run=1
353
)
354
355
print(f"Mean execution time: {results['mean']:.4f} seconds")
356
print(f"Standard deviation: {results['std']:.4f} seconds")
357
print(f"Min time: {results['min']:.4f} seconds")
358
print(f"Max time: {results['max']:.4f} seconds")
359
print(f"Throughput: {results['throughput']:.2f} GFLOPS")
360
```
361
362
### Time Range Profiling
363
364
```python
365
# Using time_range for custom timing
366
with profiler.time_range() as timer:
367
# Complex computation sequence
368
x = cp.random.rand(10000, 10000)
369
y = cp.fft.fft2(x)
370
z = cp.abs(y) ** 2
371
result = cp.sum(z)
372
373
# Ensure all operations complete
374
cp.cuda.synchronize()
375
376
print(f"Total execution time: {timer.elapsed_time:.4f} seconds")
377
378
# Multiple timing ranges
379
operations = {}
380
381
with profiler.time_range() as timer:
382
data = cp.random.rand(5000, 5000)
383
operations['data_generation'] = timer.elapsed_time
384
385
with profiler.time_range() as timer:
386
processed = cp.sin(data) * cp.cos(data)
387
operations['trigonometric'] = timer.elapsed_time
388
389
with profiler.time_range() as timer:
390
result = cp.linalg.svd(processed[:1000, :1000])
391
operations['svd'] = timer.elapsed_time
392
393
for op, time in operations.items():
394
print(f"{op}: {time:.4f} seconds")
395
```
396
397
### Memory Profiling
398
399
```python
400
# Memory usage analysis
401
with profiler.memory_profile() as mem_prof:
402
# Allocate large arrays
403
arrays = []
404
for i in range(10):
405
arr = cp.random.rand(1000, 1000)
406
arrays.append(arr)
407
408
# Perform operations that may fragment memory
409
results = []
410
for arr in arrays:
411
processed = cp.fft.fft2(arr)
412
filtered = cp.abs(processed) > 0.5
413
results.append(cp.sum(filtered))
414
415
# Cleanup some arrays
416
del arrays[:5]
417
418
print(f"Peak memory usage: {mem_prof.peak_memory / 1024**3:.2f} GB")
419
mem_prof.print_memory_report()
420
421
# Memory trace analysis
422
profiler.trace_memory(True)
423
424
# Operations to trace
425
large_array = cp.zeros((10000, 10000))
426
temp_arrays = [cp.random.rand(1000, 1000) for _ in range(50)]
427
del temp_arrays # Free memory
428
429
# Get memory trace
430
trace = profiler.get_memory_trace()
431
print(f"Number of memory operations: {len(trace)}")
432
433
profiler.trace_memory(False)
434
```
435
436
### NVTX Profiling Integration
437
438
```python
439
# Using NVTX markers for external profiling tools
440
@profiler.nvtx("matrix_operations", color=0xFF0000)
441
def complex_matrix_operations(data):
442
"""Function with NVTX profiling markers."""
443
444
with profiler.NVTXRange("preprocessing", color=0x00FF00):
445
# Data preprocessing
446
normalized = (data - cp.mean(data)) / cp.std(data)
447
scaled = normalized * 2.0
448
449
with profiler.NVTXRange("computation", color=0x0000FF):
450
# Main computation
451
result = cp.linalg.matrix_power(scaled, 3)
452
eigenvals = cp.linalg.eigvals(result)
453
454
with profiler.NVTXRange("postprocessing", color=0xFFFF00):
455
# Postprocessing
456
sorted_vals = cp.sort(eigenvals)
457
final_result = cp.real(sorted_vals)
458
459
return final_result
460
461
# Use the profiled function
462
test_matrix = cp.random.rand(500, 500, dtype=cp.complex64)
463
result = complex_matrix_operations(test_matrix)
464
465
# Manual NVTX markers
466
profiler.nvtx_mark("Starting algorithm", color=0xFF00FF)
467
468
profiler.nvtx_push("Algorithm Phase 1", color=0x00FFFF)
469
# Phase 1 operations
470
phase1_data = cp.random.rand(1000, 1000)
471
phase1_result = cp.sum(phase1_data, axis=0)
472
profiler.nvtx_pop()
473
474
profiler.nvtx_push("Algorithm Phase 2", color=0xFF8000)
475
# Phase 2 operations
476
phase2_result = cp.cumsum(phase1_result)
477
profiler.nvtx_pop()
478
479
profiler.nvtx_mark("Algorithm completed", color=0x8000FF)
480
```
481
482
### Comprehensive Profiling
483
484
```python
485
# Full profiling session
486
with profiler.profile() as prof:
487
# Data preparation
488
print("Preparing data...")
489
data_size = 8192
490
matrix_a = cp.random.rand(data_size, data_size, dtype=cp.float32)
491
matrix_b = cp.random.rand(data_size, data_size, dtype=cp.float32)
492
493
# Matrix multiplication
494
print("Performing matrix multiplication...")
495
result_mm = cp.dot(matrix_a, matrix_b)
496
497
# FFT operations
498
print("Performing FFT...")
499
fft_data = cp.random.rand(data_size, data_size, dtype=cp.complex64)
500
fft_result = cp.fft.fft2(fft_data)
501
502
# Reduction operations
503
print("Performing reductions...")
504
sum_result = cp.sum(result_mm)
505
mean_result = cp.mean(fft_result)
506
507
# Linear algebra
508
print("Performing linear algebra...")
509
smaller_matrix = matrix_a[:1000, :1000]
510
eigenvals = cp.linalg.eigvals(smaller_matrix)
511
512
# Custom kernel
513
print("Running custom kernel...")
514
@cp.ElementwiseKernel('T x, T y', 'T z', 'z = sqrt(x*x + y*y)')
515
def magnitude_kernel(x, y):
516
pass
517
518
mag_result = magnitude_kernel(matrix_a, matrix_b)
519
520
# Print comprehensive report
521
prof.print_report()
522
523
# Save report to file
524
prof.save_report("profiling_report.txt")
525
```
526
527
### Kernel Performance Analysis
528
529
```python
530
# Analyze custom kernel performance
531
kernel_code = r'''
532
extern "C" __global__
533
void optimized_reduction(float* input, float* output, int n) {
534
extern __shared__ float sdata[];
535
536
unsigned int tid = threadIdx.x;
537
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
538
539
sdata[tid] = (i < n) ? input[i] : 0.0f;
540
__syncthreads();
541
542
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
543
if (tid < s) {
544
sdata[tid] += sdata[tid + s];
545
}
546
__syncthreads();
547
}
548
549
if (tid == 0) output[blockIdx.x] = sdata[0];
550
}
551
'''
552
553
reduction_kernel = cp.RawKernel(kernel_code, 'optimized_reduction')
554
555
# Analyze kernel performance
556
with profiler.kernel_profile() as kernel_prof:
557
input_data = cp.random.rand(1000000, dtype=cp.float32)
558
block_size = 256
559
grid_size = (input_data.size + block_size - 1) // block_size
560
output = cp.zeros(grid_size, dtype=cp.float32)
561
562
# Launch kernel multiple times
563
for _ in range(100):
564
reduction_kernel(
565
(grid_size,),
566
(block_size,),
567
(input_data, output, input_data.size),
568
shared_mem=block_size * 4
569
)
570
571
kernel_prof.print_kernel_report()
572
573
# Occupancy analysis
574
occupancy_info = profiler.analyze_occupancy(
575
reduction_kernel,
576
block_size=256,
577
shared_mem=256 * 4
578
)
579
580
print("Occupancy Analysis:")
581
print(f"Theoretical occupancy: {occupancy_info['theoretical_occupancy']:.2%}")
582
print(f"Blocks per SM: {occupancy_info['blocks_per_sm']}")
583
print(f"Threads per SM: {occupancy_info['threads_per_sm']}")
584
```
585
586
### Comparative Benchmarking
587
588
```python
589
# Compare different matrix multiplication implementations
590
def cupy_dot(a, b):
591
return cp.dot(a, b)
592
593
def cupy_matmul(a, b):
594
return cp.matmul(a, b)
595
596
def cupy_einsum(a, b):
597
return cp.einsum('ij,jk->ik', a, b)
598
599
# Prepare test matrices
600
size = 2048
601
a = cp.random.rand(size, size, dtype=cp.float32)
602
b = cp.random.rand(size, size, dtype=cp.float32)
603
604
# Compare implementations
605
comparison = profiler.compare_implementations(
606
cupy_dot, cupy_matmul, cupy_einsum,
607
args=(a, b),
608
names=['cp.dot', 'cp.matmul', 'cp.einsum']
609
)
610
611
print("Performance Comparison:")
612
for name, stats in comparison.items():
613
print(f"{name:12}: {stats['mean']:.4f}s ± {stats['std']:.4f}s")
614
615
# Parameter sweep for optimal block size
616
def custom_kernel_test(data, block_size):
617
# Custom kernel with configurable block size
618
threads_per_block = block_size
619
blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block
620
621
result = cp.zeros(blocks_per_grid)
622
# Kernel launch would go here
623
return result
624
625
data = cp.random.rand(1000000)
626
param_ranges = {'block_size': [64, 128, 256, 512, 1024]}
627
628
sweep_results = profiler.parameter_sweep(
629
custom_kernel_test,
630
param_ranges,
631
fixed_args=(data,)
632
)
633
634
print("Parameter Sweep Results:")
635
for params, timing in sweep_results.items():
636
print(f"Block size {params['block_size']}: {timing['mean']:.4f}s")
637
```
638
639
### Scaling Analysis
640
641
```python
642
# Analyze how performance scales with data size
643
def scaling_test_function(data):
644
# Test function that should scale with data size
645
result = cp.fft.fft(data)
646
magnitude = cp.abs(result)
647
return cp.sum(magnitude)
648
649
# Test with different data sizes
650
data_sizes = [1000, 5000, 10000, 50000, 100000, 500000, 1000000]
651
652
scaling_results = profiler.scaling_analysis(
653
scaling_test_function,
654
data_sizes,
655
dtype=cp.complex64
656
)
657
658
print("Scaling Analysis:")
659
print("Size\t\tTime (s)\tThroughput (MB/s)")
660
for size, stats in scaling_results.items():
661
throughput = (size * 8) / (stats['mean'] * 1024**2) # Complex64 = 8 bytes
662
print(f"{size:8}\t{stats['mean']:.4f}\t\t{throughput:.2f}")
663
664
# Memory bandwidth test
665
def memory_bandwidth_test(size):
666
"""Test memory bandwidth with different array sizes."""
667
data = cp.random.rand(size, dtype=cp.float32)
668
return cp.sum(data)
669
670
memory_sizes = [10**i for i in range(4, 8)] # 10K to 10M elements
671
bandwidth_results = profiler.scaling_analysis(
672
memory_bandwidth_test,
673
memory_sizes
674
)
675
676
print("\nMemory Bandwidth Analysis:")
677
for size, stats in bandwidth_results.items():
678
bandwidth_gbps = (size * 4) / (stats['mean'] * 1024**3) # Float32 = 4 bytes
679
print(f"Size: {size:8} elements, Bandwidth: {bandwidth_gbps:.2f} GB/s")
680
```
681
682
### Advanced Profiling Workflows
683
684
```python
685
# Production profiling workflow
686
class ProductionProfiler:
687
def __init__(self, enable_profiling=True):
688
self.enable_profiling = enable_profiling
689
self.profiles = {}
690
691
def profile_section(self, name):
692
"""Context manager for profiling code sections."""
693
if not self.enable_profiling:
694
return profiler.time_range() # No-op profiler
695
696
return profiler.time_range()
697
698
def benchmark_operation(self, name, func, *args, **kwargs):
699
"""Benchmark a specific operation."""
700
if not self.enable_profiling:
701
return func(*args, **kwargs)
702
703
with profiler.time_range() as timer:
704
result = func(*args, **kwargs)
705
706
self.profiles[name] = timer.elapsed_time
707
return result
708
709
def print_summary(self):
710
"""Print profiling summary."""
711
if not self.profiles:
712
print("No profiling data collected")
713
return
714
715
print("Performance Summary:")
716
print("-" * 40)
717
total_time = sum(self.profiles.values())
718
719
for name, time in sorted(self.profiles.items(), key=lambda x: x[1], reverse=True):
720
percentage = (time / total_time) * 100
721
print(f"{name:25}: {time:.4f}s ({percentage:.1f}%)")
722
723
print("-" * 40)
724
print(f"Total time: {total_time:.4f}s")
725
726
# Use production profiler
727
profiler_instance = ProductionProfiler(enable_profiling=True)
728
729
# Profile different operations
730
data = profiler_instance.benchmark_operation(
731
"data_generation",
732
cp.random.rand,
733
5000, 5000
734
)
735
736
fft_result = profiler_instance.benchmark_operation(
737
"fft_computation",
738
cp.fft.fft2,
739
data
740
)
741
742
with profiler_instance.profile_section("postprocessing") as timer:
743
magnitude = cp.abs(fft_result)
744
result = cp.sum(magnitude)
745
profiler_instance.profiles["postprocessing"] = timer.elapsed_time
746
747
# Print comprehensive summary
748
profiler_instance.print_summary()
749
```
750
751
Performance profiling in CuPy provides essential tools for optimizing GPU applications, identifying bottlenecks, measuring execution characteristics, and ensuring optimal utilization of GPU resources across different computational workloads and hardware configurations.