0
# CUDA Integration
1
2
CuPy provides comprehensive CUDA integration capabilities for advanced GPU programming, offering direct device management, memory operations, kernel execution, stream processing, and low-level CUDA API access optimized for high-performance computing applications.
3
4
## Capabilities
5
6
### Device Management
7
8
Core CUDA device management for controlling GPU devices and execution contexts.
9
10
```python { .api }
11
class Device:
12
"""
13
CUDA device context manager.
14
15
This class provides a convenient interface for managing CUDA device
16
contexts and switching between multiple GPUs.
17
"""
18
def __init__(self, device=None):
19
"""
20
Parameters:
21
device: int or Device, optional - CUDA device ID or Device object
22
"""
23
24
def __enter__(self): ...
25
def __exit__(self, *args): ...
26
27
def use(self):
28
"""Use this device for subsequent operations."""
29
30
@property
31
def id(self):
32
"""Get the device ID."""
33
34
def get_device_id():
35
"""
36
Get the current CUDA device ID.
37
"""
38
39
def get_cublas_handle():
40
"""
41
Get the cuBLAS handle for the current device.
42
"""
43
44
def synchronize():
45
"""
46
Synchronize the current device.
47
"""
48
49
def is_available():
50
"""
51
Check if CUDA is available.
52
"""
53
```
54
55
### Memory Management
56
57
Comprehensive memory management for GPU device memory, including allocators and memory pools.
58
59
```python { .api }
60
def alloc(size):
61
"""
62
Allocate device memory.
63
64
Parameters:
65
size: int - Size in bytes to allocate
66
"""
67
68
def malloc_managed(size):
69
"""
70
Allocate managed memory (Unified Memory).
71
72
Parameters:
73
size: int - Size in bytes to allocate
74
"""
75
76
def malloc_async(size):
77
"""
78
Allocate memory asynchronously.
79
80
Parameters:
81
size: int - Size in bytes to allocate
82
"""
83
84
class BaseMemory:
85
"""
86
Base class for memory objects.
87
88
This is the base class for all memory types in CuPy,
89
providing common interface for memory management.
90
"""
91
def __init__(self, size): ...
92
93
@property
94
def ptr(self):
95
"""Get memory pointer."""
96
97
@property
98
def size(self):
99
"""Get memory size in bytes."""
100
101
class Memory(BaseMemory):
102
"""
103
Device memory object.
104
105
Represents a chunk of device memory allocated on GPU.
106
"""
107
108
class ManagedMemory(BaseMemory):
109
"""
110
Managed memory object.
111
112
Represents unified memory accessible from both CPU and GPU.
113
"""
114
115
class MemoryAsync(BaseMemory):
116
"""
117
Asynchronous memory object.
118
119
Represents memory allocated asynchronously using memory pools.
120
"""
121
122
class MemoryPointer:
123
"""
124
Pointer to a device memory region.
125
126
This class represents a pointer to device memory and provides
127
methods for accessing and manipulating memory contents.
128
"""
129
def __init__(self, mem, offset, size, owner=None): ...
130
131
def copy_from_device(self, src, size): ...
132
def copy_from_device_async(self, src, size, stream=None): ...
133
def copy_from_host(self, mem, size): ...
134
def copy_from_host_async(self, mem, size, stream=None): ...
135
def copy_to_host(self, mem, size): ...
136
def copy_to_host_async(self, mem, size, stream=None): ...
137
def memset(self, value, size): ...
138
def memset_async(self, value, size, stream=None): ...
139
140
class UnownedMemory:
141
"""
142
Unowned memory reference.
143
144
Represents a reference to memory that is not owned by this object,
145
useful for wrapping external memory allocations.
146
"""
147
```
148
149
### Memory Pools
150
151
Memory pooling systems for efficient memory allocation and reuse.
152
153
```python { .api }
154
class MemoryPool:
155
"""
156
Memory pool for device memory.
157
158
Memory pools reduce allocation overhead by reusing previously
159
allocated memory blocks.
160
"""
161
def __init__(self, allocator=None):
162
"""
163
Parameters:
164
allocator: function, optional - Custom allocator function
165
"""
166
167
def malloc(self, size): ...
168
def free(self, mem): ...
169
def free_all_blocks(self): ...
170
def free_all_free(self): ...
171
def n_free_blocks(self): ...
172
def used_bytes(self): ...
173
def total_bytes(self): ...
174
175
class MemoryAsyncPool:
176
"""
177
Asynchronous memory pool.
178
179
Provides asynchronous memory allocation with stream ordering.
180
"""
181
def __init__(self, allocator=None): ...
182
183
def set_allocator(allocator):
184
"""
185
Set the memory allocator.
186
187
Parameters:
188
allocator: function or Allocator - Memory allocator to use
189
"""
190
191
def get_allocator():
192
"""
193
Get the current memory allocator.
194
"""
195
196
class PythonFunctionAllocator:
197
"""
198
Memory allocator using a Python function.
199
200
Wraps a Python function to provide custom memory allocation.
201
"""
202
def __init__(self, func, arg): ...
203
204
class CFunctionAllocator:
205
"""
206
Memory allocator using a C function.
207
208
Wraps a C function pointer for memory allocation.
209
"""
210
def __init__(self, func, arg): ...
211
212
def using_allocator(allocator=None):
213
"""
214
Context manager for temporarily using a different allocator.
215
216
Parameters:
217
allocator: Allocator, optional - Allocator to use temporarily
218
"""
219
```
220
221
### Pinned Memory
222
223
Host-side pinned memory management for efficient host-device transfers.
224
225
```python { .api }
226
def alloc_pinned_memory(size):
227
"""
228
Allocate pinned host memory.
229
230
Parameters:
231
size: int - Size in bytes to allocate
232
"""
233
234
class PinnedMemory:
235
"""
236
Pinned host memory object.
237
238
Represents page-locked host memory that can be accessed
239
by the GPU for faster transfers.
240
"""
241
def __init__(self, size): ...
242
243
class PinnedMemoryPointer:
244
"""
245
Pointer to pinned memory region.
246
247
Provides interface for accessing pinned memory contents.
248
"""
249
def __init__(self, mem, offset, size, owner): ...
250
251
class PinnedMemoryPool:
252
"""
253
Memory pool for pinned memory.
254
255
Manages allocation and reuse of pinned host memory.
256
"""
257
def __init__(self, allocator=None): ...
258
def malloc(self, size): ...
259
def free(self, mem): ...
260
261
def set_pinned_memory_allocator(allocator):
262
"""
263
Set the pinned memory allocator.
264
265
Parameters:
266
allocator: function - Pinned memory allocator function
267
"""
268
```
269
270
### Streams and Events
271
272
CUDA streams and events for managing asynchronous operations and synchronization.
273
274
```python { .api }
275
class Stream:
276
"""
277
CUDA stream for asynchronous operations.
278
279
Streams allow operations to be executed asynchronously and
280
can be used to overlap computation and memory transfers.
281
"""
282
def __init__(self, null=False, non_blocking=False, ptds=False):
283
"""
284
Parameters:
285
null: bool, optional - Use the default stream if True
286
non_blocking: bool, optional - Create a non-blocking stream
287
ptds: bool, optional - Use per-thread default stream
288
"""
289
290
def __enter__(self): ...
291
def __exit__(self, *args): ...
292
293
def synchronize(self): ...
294
def add_callback(self, callback, arg): ...
295
def record(self, event=None): ...
296
def wait_event(self, event): ...
297
298
@property
299
def ptr(self):
300
"""Get the stream pointer."""
301
302
class ExternalStream:
303
"""
304
Wrapper for external CUDA stream.
305
306
Allows integration with CUDA streams created outside of CuPy.
307
"""
308
def __init__(self, ptr): ...
309
310
def get_current_stream():
311
"""
312
Get the current CUDA stream.
313
"""
314
315
class Event:
316
"""
317
CUDA event for synchronization.
318
319
Events provide a way to monitor the progress of operations
320
and synchronize between different streams.
321
"""
322
def __init__(self, block=True, disable_timing=False, interprocess=False):
323
"""
324
Parameters:
325
block: bool, optional - Use blocking synchronization
326
disable_timing: bool, optional - Disable timing measurements
327
interprocess: bool, optional - Enable interprocess usage
328
"""
329
330
def record(self, stream=None): ...
331
def synchronize(self): ...
332
def query(self): ...
333
def elapsed_time(self, end_event): ...
334
335
def get_elapsed_time(start_event, end_event):
336
"""
337
Get elapsed time between events.
338
339
Parameters:
340
start_event: Event - Start event
341
end_event: Event - End event
342
"""
343
```
344
345
### CUDA Graphs
346
347
CUDA graphs for optimizing sequences of operations.
348
349
```python { .api }
350
class Graph:
351
"""
352
CUDA graph for capturing and replaying operation sequences.
353
354
Graphs allow capturing a sequence of CUDA operations and
355
replaying them efficiently with reduced launch overhead.
356
"""
357
def __init__(self): ...
358
359
def begin_capture(self, stream=None): ...
360
def end_capture(self, stream=None): ...
361
def launch(self, stream=None): ...
362
def debug_dot_print(self, path): ...
363
```
364
365
### Kernels and Modules
366
367
CUDA kernel compilation and execution management.
368
369
```python { .api }
370
class Function:
371
"""
372
CUDA function object.
373
374
Represents a compiled CUDA kernel function that can be launched
375
with specified grid and block dimensions.
376
"""
377
def __init__(self, module, name):
378
"""
379
Parameters:
380
module: Module - CUDA module containing the function
381
name: str - Function name
382
"""
383
384
def __call__(self, grid, block, args, **kwargs): ...
385
386
@property
387
def attributes(self):
388
"""Get function attributes."""
389
390
class Module:
391
"""
392
CUDA module containing compiled device code.
393
394
Modules contain one or more CUDA kernels and can be loaded
395
from PTX or CUBIN code.
396
"""
397
def __init__(self): ...
398
399
def get_function(self, name): ...
400
def get_global(self, name): ...
401
def get_texref(self, name): ...
402
403
@classmethod
404
def load_file(cls, filename): ...
405
406
@classmethod
407
def load_from_string(cls, source): ...
408
```
409
410
### Memory Hooks
411
412
Hooks for monitoring and controlling memory allocation behavior.
413
414
```python { .api }
415
class MemoryHook:
416
"""
417
Base class for memory allocation hooks.
418
419
Memory hooks allow monitoring and customization of memory
420
allocation and deallocation operations.
421
"""
422
def alloc_preprocess(self, **kwargs): ...
423
def alloc_postprocess(self, mem): ...
424
def free_preprocess(self, mem): ...
425
def free_postprocess(self, mem): ...
426
```
427
428
### Profiling and Debugging
429
430
Tools for profiling and debugging CUDA applications.
431
432
```python { .api }
433
def profile():
434
"""
435
Context manager for CUDA profiling (deprecated).
436
437
Note: This is deprecated. Use cupyx.profiler.profile() instead.
438
"""
439
```
440
441
### Environment Information
442
443
Functions for querying CUDA runtime and environment information.
444
445
```python { .api }
446
def get_local_runtime_version():
447
"""
448
Get the local CUDA runtime version.
449
"""
450
451
def get_cuda_path():
452
"""
453
Get the CUDA installation path.
454
"""
455
456
def get_nvcc_path():
457
"""
458
Get the path to nvcc compiler.
459
"""
460
461
def get_rocm_path():
462
"""
463
Get the ROCm installation path (for AMD GPUs).
464
"""
465
466
def get_hipcc_path():
467
"""
468
Get the path to hipcc compiler (for AMD GPUs).
469
"""
470
```
471
472
### Low-level API Access
473
474
Access to low-level CUDA APIs for advanced users.
475
476
```python { .api }
477
# CUDA Driver API
478
driver = cupy.cuda.driver
479
480
# CUDA Runtime API
481
runtime = cupy.cuda.runtime
482
483
# NVRTC Compiler API
484
nvrtc = cupy.cuda.nvrtc
485
486
# Backend library wrappers (lazy-loaded)
487
cublas = cupy.cuda.cublas # cuBLAS operations
488
cusolver = cupy.cuda.cusolver # cuSOLVER linear algebra
489
cusparse = cupy.cuda.cusparse # cuSPARSE sparse operations
490
curand = cupy.cuda.curand # cuRAND random numbers
491
nvtx = cupy.cuda.nvtx # NVTX profiling markers
492
```
493
494
## Usage Examples
495
496
```python
497
import cupy as cp
498
import cupy.cuda as cuda
499
500
# Device management
501
print(f"Current device: {cuda.get_device_id()}")
502
print(f"CUDA available: {cuda.is_available()}")
503
504
# Using specific devices
505
with cuda.Device(0):
506
# Operations on device 0
507
x = cp.array([1, 2, 3])
508
509
with cuda.Device(1): # If multiple GPUs available
510
# Operations on device 1
511
y = cp.array([4, 5, 6])
512
513
# Memory management
514
# Direct memory allocation
515
mem = cuda.alloc(1024) # Allocate 1KB
516
ptr = cuda.MemoryPointer(mem, 0, 1024)
517
518
# Using memory pools (recommended)
519
pool = cuda.MemoryPool()
520
with cuda.using_allocator(pool.malloc):
521
# All allocations use the pool
522
large_array = cp.zeros((10000, 10000))
523
524
# Memory pool statistics
525
print(f"Used memory: {pool.used_bytes()} bytes")
526
print(f"Total memory: {pool.total_bytes()} bytes")
527
528
# Stream management for asynchronous operations
529
stream1 = cuda.Stream()
530
stream2 = cuda.Stream()
531
532
with stream1:
533
# Operations executed on stream1
534
a = cp.random.rand(1000, 1000)
535
b = cp.random.rand(1000, 1000)
536
537
with stream2:
538
# Operations executed on stream2 (can overlap with stream1)
539
c = cp.random.rand(1000, 1000)
540
d = cp.random.rand(1000, 1000)
541
542
# Synchronization
543
stream1.synchronize() # Wait for stream1 to complete
544
stream2.synchronize() # Wait for stream2 to complete
545
546
# Event-based synchronization
547
event = cuda.Event()
548
with stream1:
549
result1 = cp.dot(a, b)
550
event.record() # Record completion of operations
551
552
with stream2:
553
stream2.wait_event(event) # Wait for stream1 operations
554
result2 = cp.dot(c, d) + result1 # Uses result from stream1
555
556
# Measuring execution time with events
557
start_event = cuda.Event()
558
end_event = cuda.Event()
559
560
start_event.record()
561
# Some operations
562
large_computation = cp.dot(cp.random.rand(5000, 5000),
563
cp.random.rand(5000, 5000))
564
end_event.record()
565
end_event.synchronize()
566
567
elapsed_ms = cuda.get_elapsed_time(start_event, end_event)
568
print(f"Computation took {elapsed_ms} ms")
569
570
# Pinned memory for faster transfers
571
pinned_mem = cuda.alloc_pinned_memory(1000 * 8) # 1000 float64s
572
pinned_array = cp.ndarray((1000,), dtype=cp.float64,
573
memptr=cuda.MemoryPointer(pinned_mem, 0, 1000 * 8))
574
575
# Custom kernel example using RawKernel
576
kernel_code = r'''
577
extern "C" __global__
578
void vector_add(float* x, float* y, float* z, int n) {
579
int tid = blockDim.x * blockIdx.x + threadIdx.x;
580
if (tid < n) {
581
z[tid] = x[tid] + y[tid];
582
}
583
}
584
'''
585
586
kernel = cp.RawKernel(kernel_code, 'vector_add')
587
588
# Launch custom kernel
589
n = 1000
590
x = cp.random.rand(n, dtype=cp.float32)
591
y = cp.random.rand(n, dtype=cp.float32)
592
z = cp.zeros(n, dtype=cp.float32)
593
594
# Launch with appropriate grid/block size
595
threads_per_block = 256
596
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
597
kernel((blocks_per_grid,), (threads_per_block,), (x, y, z, n))
598
599
# Memory hooks for monitoring
600
class MemoryTracker(cuda.MemoryHook):
601
def __init__(self):
602
self.allocated_bytes = 0
603
self.freed_bytes = 0
604
605
def alloc_postprocess(self, mem):
606
self.allocated_bytes += mem.size
607
print(f"Allocated {mem.size} bytes")
608
609
def free_preprocess(self, mem):
610
self.freed_bytes += mem.size
611
print(f"Freed {mem.size} bytes")
612
613
tracker = MemoryTracker()
614
# Note: Memory hooks integration depends on CuPy version
615
616
# Working with CUDA graphs (for CUDA 10.0+)
617
if hasattr(cuda, 'Graph'):
618
graph = cuda.Graph()
619
620
# Capture operations in a graph
621
stream = cuda.Stream()
622
with stream:
623
graph.begin_capture(stream)
624
625
# Operations to be captured
626
x = cp.random.rand(1000, 1000)
627
y = cp.random.rand(1000, 1000)
628
z = x @ y
629
630
graph.end_capture(stream)
631
632
# Replay the graph multiple times efficiently
633
for _ in range(10):
634
graph.launch(stream)
635
stream.synchronize()
636
637
# Multi-GPU computation example
638
def multi_gpu_computation(data_list):
639
"""Distribute computation across multiple GPUs."""
640
n_gpus = cuda.runtime.getDeviceCount()
641
streams = []
642
results = []
643
644
for i, data in enumerate(data_list[:n_gpus]):
645
device_id = i % n_gpus
646
with cuda.Device(device_id):
647
stream = cuda.Stream()
648
streams.append(stream)
649
650
with stream:
651
# Transfer data to this GPU
652
gpu_data = cp.asarray(data)
653
# Perform computation
654
result = cp.sum(gpu_data ** 2)
655
results.append(result)
656
657
# Synchronize all streams
658
for stream in streams:
659
stream.synchronize()
660
661
return results
662
663
# Memory bandwidth benchmark
664
def memory_bandwidth_test(size_mb=100):
665
"""Test memory bandwidth between host and device."""
666
size_bytes = size_mb * 1024 * 1024
667
668
# Host memory
669
host_data = cp.asnumpy(cp.random.rand(size_bytes // 8))
670
671
# Pinned host memory for faster transfers
672
pinned_mem = cuda.alloc_pinned_memory(size_bytes)
673
674
# Time regular vs pinned memory transfers
675
import time
676
677
# Regular host memory
678
start = time.time()
679
gpu_data1 = cp.asarray(host_data)
680
cp.cuda.synchronize()
681
regular_time = time.time() - start
682
683
# Pinned memory (requires copying to pinned first)
684
start = time.time()
685
# Copy to pinned then to GPU would be done here
686
# This is a simplified example
687
pinned_time = time.time() - start
688
689
bandwidth_regular = size_mb / regular_time
690
print(f"Regular memory bandwidth: {bandwidth_regular:.2f} MB/s")
691
692
# Advanced memory pool configuration
693
def configure_memory_pool():
694
"""Configure memory pool for optimal performance."""
695
# Get the default memory pool
696
mempool = cp.get_default_memory_pool()
697
698
# Set memory pool growth strategy
699
# mempool.set_limit(size=2**30) # Limit to 1GB
700
701
# Monitor memory usage
702
print(f"Used bytes: {mempool.used_bytes()}")
703
print(f"Total bytes: {mempool.total_bytes()}")
704
705
# Force cleanup of unused memory
706
mempool.free_all_free()
707
708
return mempool
709
710
# Context management for robust error handling
711
def safe_gpu_computation():
712
"""Example of robust GPU computation with proper cleanup."""
713
stream = None
714
temp_arrays = []
715
716
try:
717
stream = cuda.Stream()
718
719
with stream:
720
# Temporary arrays that need cleanup
721
temp1 = cp.random.rand(10000, 10000)
722
temp2 = cp.random.rand(10000, 10000)
723
temp_arrays.extend([temp1, temp2])
724
725
# Main computation
726
result = temp1 @ temp2
727
728
# Synchronize to ensure completion
729
stream.synchronize()
730
731
return result
732
733
except Exception as e:
734
print(f"GPU computation failed: {e}")
735
return None
736
737
finally:
738
# Cleanup resources
739
if stream:
740
stream.synchronize()
741
742
# Force garbage collection of temporary arrays
743
del temp_arrays
744
cp.get_default_memory_pool().free_all_free()
745
```
746
747
## Performance Optimization Tips
748
749
### Memory Management
750
751
```python
752
# Use memory pools to reduce allocation overhead
753
with cuda.using_allocator(cp.get_default_memory_pool().malloc):
754
# All allocations reuse memory from the pool
755
data = cp.zeros((10000, 10000))
756
757
# Pre-allocate large arrays when possible
758
workspace = cp.zeros((10000, 10000)) # Reuse this array
759
760
# Use appropriate memory types
761
regular_mem = cuda.alloc(1024) # Regular device memory
762
managed_mem = cuda.malloc_managed(1024) # Unified memory
763
```
764
765
### Stream Optimization
766
767
```python
768
# Overlap computation and memory transfers
769
compute_stream = cuda.Stream()
770
transfer_stream = cuda.Stream()
771
772
with transfer_stream:
773
# Asynchronous memory transfer
774
next_data = cp.asarray(host_data)
775
776
with compute_stream:
777
# Parallel computation
778
result = process_current_data(current_data)
779
```
780
781
### Kernel Launch Optimization
782
783
```python
784
# Choose optimal grid/block dimensions
785
def optimal_launch_config(n, max_threads_per_block=1024):
786
"""Calculate optimal CUDA launch configuration."""
787
if n <= max_threads_per_block:
788
return (1, n)
789
else:
790
threads_per_block = max_threads_per_block
791
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
792
return (blocks_per_grid, threads_per_block)
793
794
grid, block = optimal_launch_config(1000000)
795
```
796
797
CUDA integration in CuPy provides comprehensive low-level GPU programming capabilities, enabling advanced memory management, asynchronous execution, custom kernel development, and performance optimization for high-performance computing applications while maintaining compatibility with the broader CUDA ecosystem.