Tessl Tile for pypi/cupy-cuda11x@13.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md cuda-integration.md custom-kernels.md fft.md index.md io-operations.md jit-compilation.md linear-algebra.md mathematical-functions.md performance-profiling.md polynomial-operations.md random.md scipy-extensions.md

performance-profiling.mddocs/

0
# Performance Profiling
1

2
CuPy provides comprehensive performance profiling and benchmarking tools through the `cupyx.profiler` module, enabling developers to measure execution times, analyze GPU utilization, identify performance bottlenecks, and optimize CUDA applications for maximum throughput and efficiency.
3

4
## Capabilities
5

6
### Timing and Benchmarking
7

8
Core timing utilities for measuring execution performance of CuPy operations and custom kernels.
9

10
```python { .api }
11
def benchmark(func, args=(), kwargs=None, n_warmup=1, n_repeat=3, n_run=1):
12
    """
13
    Benchmark a function with comprehensive timing statistics.
14
    
15
    Executes the function multiple times and provides detailed
16
    timing statistics including mean, standard deviation, min/max
17
    execution times, and GPU/CPU timing analysis.
18
    
19
    Parameters:
20
        func: callable - Function to benchmark
21
        args: tuple, optional - Positional arguments for function
22
        kwargs: dict, optional - Keyword arguments for function
23
        n_warmup: int, optional - Number of warmup runs (default 1)
24
        n_repeat: int, optional - Number of timing repetitions (default 3)
25
        n_run: int, optional - Number of function calls per repetition (default 1)
26
    
27
    Returns:
28
        dict: Benchmark results with timing statistics
29
    """
30

31
def time_range():
32
    """
33
    Context manager for measuring execution time ranges.
34
    
35
    Returns a context manager that measures the time between
36
    entry and exit, accounting for GPU synchronization.
37
    
38
    Returns:
39
        TimeRangeContext: Context manager for timing
40
    """
41

42
class TimeRangeContext:
43
    """
44
    Context manager for timing code execution ranges.
45
    
46
    Provides precise timing measurements for GPU operations
47
    with proper synchronization handling.
48
    """
49
    def __enter__(self): ...
50
    def __exit__(self, *args): ...
51
    
52
    @property
53
    def elapsed_time(self):
54
        """Get elapsed time in seconds."""
55

56
def profile():
57
    """
58
    Context manager for comprehensive profiling.
59
    
60
    Enables detailed profiling including NVTX markers,
61
    memory usage tracking, and kernel execution analysis.
62
    
63
    Returns:
64
        ProfileContext: Context manager for profiling
65
    """
66

67
class ProfileContext:
68
    """
69
    Context manager for comprehensive performance profiling.
70
    
71
    Collects detailed performance metrics including timing,
72
    memory usage, kernel launches, and GPU utilization.
73
    """
74
    def __enter__(self): ...
75
    def __exit__(self, *args): ...
76
    
77
    def print_report(self):
78
        """Print detailed profiling report."""
79
    
80
    def save_report(self, filename):
81
        """Save profiling report to file."""
82
```
83

84
### Memory Profiling
85

86
Tools for analyzing GPU memory usage patterns and identifying memory bottlenecks.
87

88
```python { .api }
89
def get_memory_info():
90
    """
91
    Get current GPU memory usage information.
92
    
93
    Returns:
94
        dict: Memory usage statistics including total, used, and free memory
95
    """
96

97
def memory_profile():
98
    """
99
    Context manager for memory usage profiling.
100
    
101
    Tracks memory allocations and deallocations during execution
102
    to identify memory usage patterns and potential leaks.
103
    
104
    Returns:
105
        MemoryProfileContext: Context manager for memory profiling
106
    """
107

108
class MemoryProfileContext:
109
    """
110
    Context manager for tracking memory usage patterns.
111
    
112
    Monitors GPU memory allocations, deallocations, and peak usage
113
    during code execution.
114
    """
115
    def __enter__(self): ...
116
    def __exit__(self, *args): ...
117
    
118
    @property
119
    def peak_memory(self):
120
        """Peak memory usage during profiling."""
121
    
122
    @property
123
    def memory_allocations(self):
124
        """List of memory allocation events."""
125
    
126
    def print_memory_report(self):
127
        """Print detailed memory usage report."""
128

129
def trace_memory(enabled=True):
130
    """
131
    Enable or disable memory allocation tracing.
132
    
133
    Parameters:
134
        enabled: bool - Whether to enable memory tracing
135
    """
136

137
def get_memory_trace():
138
    """
139
    Get memory allocation trace information.
140
    
141
    Returns:
142
        list: Memory allocation trace events
143
    """
144
```
145

146
### NVTX Integration
147

148
NVIDIA Tools Extension (NVTX) integration for advanced profiling with external tools.
149

150
```python { .api }
151
def nvtx_push(message, color=None):
152
    """
153
    Push an NVTX range marker.
154
    
155
    Creates a named range marker for profiling tools like Nsight
156
    to identify code sections and their performance characteristics.
157
    
158
    Parameters:
159
        message: str - Range description
160
        color: int, optional - Color code for the range
161
    """
162

163
def nvtx_pop():
164
    """Pop the most recent NVTX range marker."""
165

166
def nvtx_mark(message, color=None):
167
    """
168
    Create an NVTX point marker.
169
    
170
    Parameters:
171
        message: str - Marker description
172
        color: int, optional - Color code for the marker
173
    """
174

175
def nvtx_range_push(message, color=None):
176
    """
177
    Push a named NVTX range (alias for nvtx_push).
178
    
179
    Parameters:
180
        message: str - Range name
181
        color: int, optional - Color code
182
    """
183

184
def nvtx_range_pop():
185
    """Pop the current NVTX range (alias for nvtx_pop)."""
186

187
class NVTXRange:
188
    """
189
    Context manager for NVTX range markers.
190
    
191
    Automatically pushes and pops NVTX range markers for
192
    convenient profiling of code blocks.
193
    """
194
    def __init__(self, message, color=None):
195
        """
196
        Parameters:
197
            message: str - Range description
198
            color: int, optional - Color code
199
        """
200
    
201
    def __enter__(self): ...
202
    def __exit__(self, *args): ...
203

204
def nvtx(message=None, color=None):
205
    """
206
    Decorator or context manager for NVTX range marking.
207
    
208
    Can be used as a decorator for functions or as a context manager
209
    for code blocks to automatically add NVTX markers.
210
    
211
    Parameters:
212
        message: str, optional - Range description
213
        color: int, optional - Color code
214
    """
215
```
216

217
### Kernel Performance Analysis
218

219
Tools for analyzing individual kernel performance and optimization opportunities.
220

221
```python { .api }
222
def kernel_profile():
223
    """
224
    Context manager for kernel-specific profiling.
225
    
226
    Tracks individual kernel launches, execution times,
227
    and performance characteristics.
228
    
229
    Returns:
230
        KernelProfileContext: Context manager for kernel profiling
231
    """
232

233
class KernelProfileContext:
234
    """
235
    Context manager for detailed kernel performance analysis.
236
    
237
    Collects metrics for individual kernel launches including
238
    execution time, occupancy, memory throughput, and compute utilization.
239
    """
240
    def __enter__(self): ...
241
    def __exit__(self, *args): ...
242
    
243
    @property
244
    def kernel_stats(self):
245
        """Statistics for executed kernels."""
246
    
247
    def print_kernel_report(self):
248
        """Print detailed kernel analysis report."""
249

250
def get_kernel_info(kernel):
251
    """
252
    Get information about a compiled kernel.
253
    
254
    Parameters:
255
        kernel: RawKernel or similar - Kernel object
256
    
257
    Returns:
258
        dict: Kernel information including occupancy and resource usage
259
    """
260

261
def analyze_occupancy(kernel, block_size, shared_mem=0):
262
    """
263
    Analyze theoretical occupancy for a kernel configuration.
264
    
265
    Parameters:
266
        kernel: kernel object - Kernel to analyze
267
        block_size: int - Block size (threads per block)
268
        shared_mem: int, optional - Shared memory usage per block
269
    
270
    Returns:
271
        dict: Occupancy analysis results
272
    """
273
```
274

275
### Comparative Benchmarking
276

277
Tools for comparing performance between different implementations and configurations.
278

279
```python { .api }
280
def compare_implementations(*funcs, args=(), kwargs=None, names=None):
281
    """
282
    Compare performance of multiple function implementations.
283
    
284
    Benchmarks multiple functions with identical inputs and provides
285
    comparative analysis of their performance characteristics.
286
    
287
    Parameters:
288
        *funcs: callable - Functions to compare
289
        args: tuple, optional - Arguments for all functions
290
        kwargs: dict, optional - Keyword arguments for all functions
291
        names: list, optional - Names for each function
292
    
293
    Returns:
294
        dict: Comparative benchmark results
295
    """
296

297
def parameter_sweep(func, param_ranges, fixed_args=(), fixed_kwargs=None):
298
    """
299
    Perform parameter sweep benchmarking.
300
    
301
    Tests function performance across different parameter values
302
    to identify optimal configurations.
303
    
304
    Parameters:
305
        func: callable - Function to benchmark
306
        param_ranges: dict - Parameter names and value ranges
307
        fixed_args: tuple, optional - Fixed positional arguments
308
        fixed_kwargs: dict, optional - Fixed keyword arguments
309
    
310
    Returns:
311
        dict: Parameter sweep results
312
    """
313

314
def scaling_analysis(func, data_sizes, *args, **kwargs):
315
    """
316
    Analyze performance scaling with different data sizes.
317
    
318
    Parameters:
319
        func: callable - Function to analyze
320
        data_sizes: list - Different input sizes to test
321
        *args: Additional function arguments
322
        **kwargs: Additional function keyword arguments
323
    
324
    Returns:
325
        dict: Scaling analysis results
326
    """
327
```
328

329
## Usage Examples
330

331
### Basic Benchmarking
332

333
```python
334
import cupy as cp
335
from cupyx import profiler
336

337
# Simple function benchmarking
338
def matrix_multiply(a, b):
339
    return cp.dot(a, b)
340

341
# Create test matrices
342
size = 2048
343
a = cp.random.rand(size, size, dtype=cp.float32)
344
b = cp.random.rand(size, size, dtype=cp.float32)
345

346
# Benchmark the function
347
results = profiler.benchmark(
348
    matrix_multiply, 
349
    args=(a, b),
350
    n_warmup=3,
351
    n_repeat=10,
352
    n_run=1
353
)
354

355
print(f"Mean execution time: {results['mean']:.4f} seconds")
356
print(f"Standard deviation: {results['std']:.4f} seconds")
357
print(f"Min time: {results['min']:.4f} seconds")
358
print(f"Max time: {results['max']:.4f} seconds")
359
print(f"Throughput: {results['throughput']:.2f} GFLOPS")
360
```
361

362
### Time Range Profiling
363

364
```python
365
# Using time_range for custom timing
366
with profiler.time_range() as timer:
367
    # Complex computation sequence
368
    x = cp.random.rand(10000, 10000)
369
    y = cp.fft.fft2(x)
370
    z = cp.abs(y) ** 2
371
    result = cp.sum(z)
372
    
373
    # Ensure all operations complete
374
    cp.cuda.synchronize()
375

376
print(f"Total execution time: {timer.elapsed_time:.4f} seconds")
377

378
# Multiple timing ranges
379
operations = {}
380

381
with profiler.time_range() as timer:
382
    data = cp.random.rand(5000, 5000)
383
operations['data_generation'] = timer.elapsed_time
384

385
with profiler.time_range() as timer:
386
    processed = cp.sin(data) * cp.cos(data)
387
operations['trigonometric'] = timer.elapsed_time
388

389
with profiler.time_range() as timer:
390
    result = cp.linalg.svd(processed[:1000, :1000])
391
operations['svd'] = timer.elapsed_time
392

393
for op, time in operations.items():
394
    print(f"{op}: {time:.4f} seconds")
395
```
396

397
### Memory Profiling
398

399
```python
400
# Memory usage analysis
401
with profiler.memory_profile() as mem_prof:
402
    # Allocate large arrays
403
    arrays = []
404
    for i in range(10):
405
        arr = cp.random.rand(1000, 1000)
406
        arrays.append(arr)
407
    
408
    # Perform operations that may fragment memory
409
    results = []
410
    for arr in arrays:
411
        processed = cp.fft.fft2(arr)
412
        filtered = cp.abs(processed) > 0.5
413
        results.append(cp.sum(filtered))
414
    
415
    # Cleanup some arrays
416
    del arrays[:5]
417

418
print(f"Peak memory usage: {mem_prof.peak_memory / 1024**3:.2f} GB")
419
mem_prof.print_memory_report()
420

421
# Memory trace analysis
422
profiler.trace_memory(True)
423

424
# Operations to trace
425
large_array = cp.zeros((10000, 10000))
426
temp_arrays = [cp.random.rand(1000, 1000) for _ in range(50)]
427
del temp_arrays  # Free memory
428

429
# Get memory trace
430
trace = profiler.get_memory_trace()
431
print(f"Number of memory operations: {len(trace)}")
432

433
profiler.trace_memory(False)
434
```
435

436
### NVTX Profiling Integration
437

438
```python
439
# Using NVTX markers for external profiling tools
440
@profiler.nvtx("matrix_operations", color=0xFF0000)
441
def complex_matrix_operations(data):
442
    """Function with NVTX profiling markers."""
443
    
444
    with profiler.NVTXRange("preprocessing", color=0x00FF00):
445
        # Data preprocessing
446
        normalized = (data - cp.mean(data)) / cp.std(data)
447
        scaled = normalized * 2.0
448
    
449
    with profiler.NVTXRange("computation", color=0x0000FF):
450
        # Main computation
451
        result = cp.linalg.matrix_power(scaled, 3)
452
        eigenvals = cp.linalg.eigvals(result)
453
    
454
    with profiler.NVTXRange("postprocessing", color=0xFFFF00):
455
        # Postprocessing
456
        sorted_vals = cp.sort(eigenvals)
457
        final_result = cp.real(sorted_vals)
458
    
459
    return final_result
460

461
# Use the profiled function
462
test_matrix = cp.random.rand(500, 500, dtype=cp.complex64)
463
result = complex_matrix_operations(test_matrix)
464

465
# Manual NVTX markers
466
profiler.nvtx_mark("Starting algorithm", color=0xFF00FF)
467

468
profiler.nvtx_push("Algorithm Phase 1", color=0x00FFFF)
469
# Phase 1 operations
470
phase1_data = cp.random.rand(1000, 1000)
471
phase1_result = cp.sum(phase1_data, axis=0)
472
profiler.nvtx_pop()
473

474
profiler.nvtx_push("Algorithm Phase 2", color=0xFF8000)
475
# Phase 2 operations  
476
phase2_result = cp.cumsum(phase1_result)
477
profiler.nvtx_pop()
478

479
profiler.nvtx_mark("Algorithm completed", color=0x8000FF)
480
```
481

482
### Comprehensive Profiling
483

484
```python
485
# Full profiling session
486
with profiler.profile() as prof:
487
    # Data preparation
488
    print("Preparing data...")
489
    data_size = 8192
490
    matrix_a = cp.random.rand(data_size, data_size, dtype=cp.float32)
491
    matrix_b = cp.random.rand(data_size, data_size, dtype=cp.float32)
492
    
493
    # Matrix multiplication
494
    print("Performing matrix multiplication...")
495
    result_mm = cp.dot(matrix_a, matrix_b)
496
    
497
    # FFT operations
498
    print("Performing FFT...")
499
    fft_data = cp.random.rand(data_size, data_size, dtype=cp.complex64)
500
    fft_result = cp.fft.fft2(fft_data)
501
    
502
    # Reduction operations
503
    print("Performing reductions...")
504
    sum_result = cp.sum(result_mm)
505
    mean_result = cp.mean(fft_result)
506
    
507
    # Linear algebra
508
    print("Performing linear algebra...")
509
    smaller_matrix = matrix_a[:1000, :1000]
510
    eigenvals = cp.linalg.eigvals(smaller_matrix)
511
    
512
    # Custom kernel
513
    print("Running custom kernel...")
514
    @cp.ElementwiseKernel('T x, T y', 'T z', 'z = sqrt(x*x + y*y)')
515
    def magnitude_kernel(x, y):
516
        pass
517
    
518
    mag_result = magnitude_kernel(matrix_a, matrix_b)
519

520
# Print comprehensive report
521
prof.print_report()
522

523
# Save report to file
524
prof.save_report("profiling_report.txt")
525
```
526

527
### Kernel Performance Analysis
528

529
```python
530
# Analyze custom kernel performance
531
kernel_code = r'''
532
extern "C" __global__
533
void optimized_reduction(float* input, float* output, int n) {
534
    extern __shared__ float sdata[];
535
    
536
    unsigned int tid = threadIdx.x;
537
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
538
    
539
    sdata[tid] = (i < n) ? input[i] : 0.0f;
540
    __syncthreads();
541
    
542
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
543
        if (tid < s) {
544
            sdata[tid] += sdata[tid + s];
545
        }
546
        __syncthreads();
547
    }
548
    
549
    if (tid == 0) output[blockIdx.x] = sdata[0];
550
}
551
'''
552

553
reduction_kernel = cp.RawKernel(kernel_code, 'optimized_reduction')
554

555
# Analyze kernel performance
556
with profiler.kernel_profile() as kernel_prof:
557
    input_data = cp.random.rand(1000000, dtype=cp.float32)
558
    block_size = 256
559
    grid_size = (input_data.size + block_size - 1) // block_size
560
    output = cp.zeros(grid_size, dtype=cp.float32)
561
    
562
    # Launch kernel multiple times
563
    for _ in range(100):
564
        reduction_kernel(
565
            (grid_size,),
566
            (block_size,),
567
            (input_data, output, input_data.size),
568
            shared_mem=block_size * 4
569
        )
570

571
kernel_prof.print_kernel_report()
572

573
# Occupancy analysis
574
occupancy_info = profiler.analyze_occupancy(
575
    reduction_kernel, 
576
    block_size=256, 
577
    shared_mem=256 * 4
578
)
579

580
print("Occupancy Analysis:")
581
print(f"Theoretical occupancy: {occupancy_info['theoretical_occupancy']:.2%}")
582
print(f"Blocks per SM: {occupancy_info['blocks_per_sm']}")
583
print(f"Threads per SM: {occupancy_info['threads_per_sm']}")
584
```
585

586
### Comparative Benchmarking
587

588
```python
589
# Compare different matrix multiplication implementations
590
def cupy_dot(a, b):
591
    return cp.dot(a, b)
592

593
def cupy_matmul(a, b):
594
    return cp.matmul(a, b)
595

596
def cupy_einsum(a, b):
597
    return cp.einsum('ij,jk->ik', a, b)
598

599
# Prepare test matrices
600
size = 2048
601
a = cp.random.rand(size, size, dtype=cp.float32)
602
b = cp.random.rand(size, size, dtype=cp.float32)
603

604
# Compare implementations
605
comparison = profiler.compare_implementations(
606
    cupy_dot, cupy_matmul, cupy_einsum,
607
    args=(a, b),
608
    names=['cp.dot', 'cp.matmul', 'cp.einsum']
609
)
610

611
print("Performance Comparison:")
612
for name, stats in comparison.items():
613
    print(f"{name:12}: {stats['mean']:.4f}s ± {stats['std']:.4f}s")
614

615
# Parameter sweep for optimal block size
616
def custom_kernel_test(data, block_size):
617
    # Custom kernel with configurable block size
618
    threads_per_block = block_size
619
    blocks_per_grid = (data.size + threads_per_block - 1) // threads_per_block
620
    
621
    result = cp.zeros(blocks_per_grid)
622
    # Kernel launch would go here
623
    return result
624

625
data = cp.random.rand(1000000)
626
param_ranges = {'block_size': [64, 128, 256, 512, 1024]}
627

628
sweep_results = profiler.parameter_sweep(
629
    custom_kernel_test,
630
    param_ranges,
631
    fixed_args=(data,)
632
)
633

634
print("Parameter Sweep Results:")
635
for params, timing in sweep_results.items():
636
    print(f"Block size {params['block_size']}: {timing['mean']:.4f}s")
637
```
638

639
### Scaling Analysis
640

641
```python
642
# Analyze how performance scales with data size
643
def scaling_test_function(data):
644
    # Test function that should scale with data size
645
    result = cp.fft.fft(data)
646
    magnitude = cp.abs(result)
647
    return cp.sum(magnitude)
648

649
# Test with different data sizes
650
data_sizes = [1000, 5000, 10000, 50000, 100000, 500000, 1000000]
651

652
scaling_results = profiler.scaling_analysis(
653
    scaling_test_function,
654
    data_sizes,
655
    dtype=cp.complex64
656
)
657

658
print("Scaling Analysis:")
659
print("Size\t\tTime (s)\tThroughput (MB/s)")
660
for size, stats in scaling_results.items():
661
    throughput = (size * 8) / (stats['mean'] * 1024**2)  # Complex64 = 8 bytes
662
    print(f"{size:8}\t{stats['mean']:.4f}\t\t{throughput:.2f}")
663

664
# Memory bandwidth test
665
def memory_bandwidth_test(size):
666
    """Test memory bandwidth with different array sizes."""
667
    data = cp.random.rand(size, dtype=cp.float32)
668
    return cp.sum(data)
669

670
memory_sizes = [10**i for i in range(4, 8)]  # 10K to 10M elements
671
bandwidth_results = profiler.scaling_analysis(
672
    memory_bandwidth_test,
673
    memory_sizes
674
)
675

676
print("\nMemory Bandwidth Analysis:")
677
for size, stats in bandwidth_results.items():
678
    bandwidth_gbps = (size * 4) / (stats['mean'] * 1024**3)  # Float32 = 4 bytes
679
    print(f"Size: {size:8} elements, Bandwidth: {bandwidth_gbps:.2f} GB/s")
680
```
681

682
### Advanced Profiling Workflows
683

684
```python
685
# Production profiling workflow
686
class ProductionProfiler:
687
    def __init__(self, enable_profiling=True):
688
        self.enable_profiling = enable_profiling
689
        self.profiles = {}
690
    
691
    def profile_section(self, name):
692
        """Context manager for profiling code sections."""
693
        if not self.enable_profiling:
694
            return profiler.time_range()  # No-op profiler
695
        
696
        return profiler.time_range()
697
    
698
    def benchmark_operation(self, name, func, *args, **kwargs):
699
        """Benchmark a specific operation."""
700
        if not self.enable_profiling:
701
            return func(*args, **kwargs)
702
        
703
        with profiler.time_range() as timer:
704
            result = func(*args, **kwargs)
705
        
706
        self.profiles[name] = timer.elapsed_time
707
        return result
708
    
709
    def print_summary(self):
710
        """Print profiling summary."""
711
        if not self.profiles:
712
            print("No profiling data collected")
713
            return
714
        
715
        print("Performance Summary:")
716
        print("-" * 40)
717
        total_time = sum(self.profiles.values())
718
        
719
        for name, time in sorted(self.profiles.items(), key=lambda x: x[1], reverse=True):
720
            percentage = (time / total_time) * 100
721
            print(f"{name:25}: {time:.4f}s ({percentage:.1f}%)")
722
        
723
        print("-" * 40)
724
        print(f"Total time: {total_time:.4f}s")
725

726
# Use production profiler
727
profiler_instance = ProductionProfiler(enable_profiling=True)
728

729
# Profile different operations
730
data = profiler_instance.benchmark_operation(
731
    "data_generation",
732
    cp.random.rand, 
733
    5000, 5000
734
)
735

736
fft_result = profiler_instance.benchmark_operation(
737
    "fft_computation",
738
    cp.fft.fft2,
739
    data
740
)
741

742
with profiler_instance.profile_section("postprocessing") as timer:
743
    magnitude = cp.abs(fft_result)
744
    result = cp.sum(magnitude)
745
profiler_instance.profiles["postprocessing"] = timer.elapsed_time
746

747
# Print comprehensive summary
748
profiler_instance.print_summary()
749
```
750

751
Performance profiling in CuPy provides essential tools for optimizing GPU applications, identifying bottlenecks, measuring execution characteristics, and ensuring optimal utilization of GPU resources across different computational workloads and hardware configurations.

Version

Tile

Files

performance-profiling.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

performance-profiling.mddocs/