Tessl Tile for pypi/cupy-cuda11x@13.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md cuda-integration.md custom-kernels.md fft.md index.md io-operations.md jit-compilation.md linear-algebra.md mathematical-functions.md performance-profiling.md polynomial-operations.md random.md scipy-extensions.md

cuda-integration.mddocs/

0
# CUDA Integration
1

2
CuPy provides comprehensive CUDA integration capabilities for advanced GPU programming, offering direct device management, memory operations, kernel execution, stream processing, and low-level CUDA API access optimized for high-performance computing applications.
3

4
## Capabilities
5

6
### Device Management
7

8
Core CUDA device management for controlling GPU devices and execution contexts.
9

10
```python { .api }
11
class Device:
12
    """
13
    CUDA device context manager.
14
    
15
    This class provides a convenient interface for managing CUDA device
16
    contexts and switching between multiple GPUs.
17
    """
18
    def __init__(self, device=None):
19
        """
20
        Parameters:
21
            device: int or Device, optional - CUDA device ID or Device object
22
        """
23
    
24
    def __enter__(self): ...
25
    def __exit__(self, *args): ...
26
    
27
    def use(self):
28
        """Use this device for subsequent operations."""
29
    
30
    @property
31
    def id(self):
32
        """Get the device ID."""
33

34
def get_device_id():
35
    """
36
    Get the current CUDA device ID.
37
    """
38

39
def get_cublas_handle():
40
    """
41
    Get the cuBLAS handle for the current device.
42
    """
43

44
def synchronize():
45
    """
46
    Synchronize the current device.
47
    """
48

49
def is_available():
50
    """
51
    Check if CUDA is available.
52
    """
53
```
54

55
### Memory Management
56

57
Comprehensive memory management for GPU device memory, including allocators and memory pools.
58

59
```python { .api }
60
def alloc(size):
61
    """
62
    Allocate device memory.
63
    
64
    Parameters:
65
        size: int - Size in bytes to allocate
66
    """
67

68
def malloc_managed(size):
69
    """
70
    Allocate managed memory (Unified Memory).
71
    
72
    Parameters:
73
        size: int - Size in bytes to allocate
74
    """
75

76
def malloc_async(size):
77
    """
78
    Allocate memory asynchronously.
79
    
80
    Parameters:
81
        size: int - Size in bytes to allocate
82
    """
83

84
class BaseMemory:
85
    """
86
    Base class for memory objects.
87
    
88
    This is the base class for all memory types in CuPy,
89
    providing common interface for memory management.
90
    """
91
    def __init__(self, size): ...
92
    
93
    @property
94
    def ptr(self):
95
        """Get memory pointer."""
96
    
97
    @property
98
    def size(self):
99
        """Get memory size in bytes."""
100

101
class Memory(BaseMemory):
102
    """
103
    Device memory object.
104
    
105
    Represents a chunk of device memory allocated on GPU.
106
    """
107

108
class ManagedMemory(BaseMemory):
109
    """
110
    Managed memory object.
111
    
112
    Represents unified memory accessible from both CPU and GPU.
113
    """
114

115
class MemoryAsync(BaseMemory):
116
    """
117
    Asynchronous memory object.
118
    
119
    Represents memory allocated asynchronously using memory pools.
120
    """
121

122
class MemoryPointer:
123
    """
124
    Pointer to a device memory region.
125
    
126
    This class represents a pointer to device memory and provides
127
    methods for accessing and manipulating memory contents.
128
    """
129
    def __init__(self, mem, offset, size, owner=None): ...
130
    
131
    def copy_from_device(self, src, size): ...
132
    def copy_from_device_async(self, src, size, stream=None): ...
133
    def copy_from_host(self, mem, size): ...
134
    def copy_from_host_async(self, mem, size, stream=None): ...
135
    def copy_to_host(self, mem, size): ...
136
    def copy_to_host_async(self, mem, size, stream=None): ...
137
    def memset(self, value, size): ...
138
    def memset_async(self, value, size, stream=None): ...
139

140
class UnownedMemory:
141
    """
142
    Unowned memory reference.
143
    
144
    Represents a reference to memory that is not owned by this object,
145
    useful for wrapping external memory allocations.
146
    """
147
```
148

149
### Memory Pools
150

151
Memory pooling systems for efficient memory allocation and reuse.
152

153
```python { .api }
154
class MemoryPool:
155
    """
156
    Memory pool for device memory.
157
    
158
    Memory pools reduce allocation overhead by reusing previously
159
    allocated memory blocks.
160
    """
161
    def __init__(self, allocator=None):
162
        """
163
        Parameters:
164
            allocator: function, optional - Custom allocator function
165
        """
166
    
167
    def malloc(self, size): ...
168
    def free(self, mem): ...
169
    def free_all_blocks(self): ...
170
    def free_all_free(self): ...
171
    def n_free_blocks(self): ...
172
    def used_bytes(self): ...
173
    def total_bytes(self): ...
174

175
class MemoryAsyncPool:
176
    """
177
    Asynchronous memory pool.
178
    
179
    Provides asynchronous memory allocation with stream ordering.
180
    """
181
    def __init__(self, allocator=None): ...
182

183
def set_allocator(allocator):
184
    """
185
    Set the memory allocator.
186
    
187
    Parameters:
188
        allocator: function or Allocator - Memory allocator to use
189
    """
190

191
def get_allocator():
192
    """
193
    Get the current memory allocator.
194
    """
195

196
class PythonFunctionAllocator:
197
    """
198
    Memory allocator using a Python function.
199
    
200
    Wraps a Python function to provide custom memory allocation.
201
    """
202
    def __init__(self, func, arg): ...
203

204
class CFunctionAllocator:
205
    """
206
    Memory allocator using a C function.
207
    
208
    Wraps a C function pointer for memory allocation.
209
    """
210
    def __init__(self, func, arg): ...
211

212
def using_allocator(allocator=None):
213
    """
214
    Context manager for temporarily using a different allocator.
215
    
216
    Parameters:
217
        allocator: Allocator, optional - Allocator to use temporarily
218
    """
219
```
220

221
### Pinned Memory
222

223
Host-side pinned memory management for efficient host-device transfers.
224

225
```python { .api }
226
def alloc_pinned_memory(size):
227
    """
228
    Allocate pinned host memory.
229
    
230
    Parameters:
231
        size: int - Size in bytes to allocate
232
    """
233

234
class PinnedMemory:
235
    """
236
    Pinned host memory object.
237
    
238
    Represents page-locked host memory that can be accessed
239
    by the GPU for faster transfers.
240
    """
241
    def __init__(self, size): ...
242

243
class PinnedMemoryPointer:
244
    """
245
    Pointer to pinned memory region.
246
    
247
    Provides interface for accessing pinned memory contents.
248
    """
249
    def __init__(self, mem, offset, size, owner): ...
250

251
class PinnedMemoryPool:
252
    """
253
    Memory pool for pinned memory.
254
    
255
    Manages allocation and reuse of pinned host memory.
256
    """
257
    def __init__(self, allocator=None): ...
258
    def malloc(self, size): ...
259
    def free(self, mem): ...
260

261
def set_pinned_memory_allocator(allocator):
262
    """
263
    Set the pinned memory allocator.
264
    
265
    Parameters:
266
        allocator: function - Pinned memory allocator function
267
    """
268
```
269

270
### Streams and Events
271

272
CUDA streams and events for managing asynchronous operations and synchronization.
273

274
```python { .api }
275
class Stream:
276
    """
277
    CUDA stream for asynchronous operations.
278
    
279
    Streams allow operations to be executed asynchronously and
280
    can be used to overlap computation and memory transfers.
281
    """
282
    def __init__(self, null=False, non_blocking=False, ptds=False):
283
        """
284
        Parameters:
285
            null: bool, optional - Use the default stream if True
286
            non_blocking: bool, optional - Create a non-blocking stream
287
            ptds: bool, optional - Use per-thread default stream
288
        """
289
    
290
    def __enter__(self): ...
291
    def __exit__(self, *args): ...
292
    
293
    def synchronize(self): ...
294
    def add_callback(self, callback, arg): ...
295
    def record(self, event=None): ...
296
    def wait_event(self, event): ...
297
    
298
    @property
299
    def ptr(self):
300
        """Get the stream pointer."""
301

302
class ExternalStream:
303
    """
304
    Wrapper for external CUDA stream.
305
    
306
    Allows integration with CUDA streams created outside of CuPy.
307
    """
308
    def __init__(self, ptr): ...
309

310
def get_current_stream():
311
    """
312
    Get the current CUDA stream.
313
    """
314

315
class Event:
316
    """
317
    CUDA event for synchronization.
318
    
319
    Events provide a way to monitor the progress of operations
320
    and synchronize between different streams.
321
    """
322
    def __init__(self, block=True, disable_timing=False, interprocess=False):
323
        """
324
        Parameters:
325
            block: bool, optional - Use blocking synchronization
326
            disable_timing: bool, optional - Disable timing measurements
327
            interprocess: bool, optional - Enable interprocess usage
328
        """
329
    
330
    def record(self, stream=None): ...
331
    def synchronize(self): ...
332
    def query(self): ...
333
    def elapsed_time(self, end_event): ...
334

335
def get_elapsed_time(start_event, end_event):
336
    """
337
    Get elapsed time between events.
338
    
339
    Parameters:
340
        start_event: Event - Start event
341
        end_event: Event - End event
342
    """
343
```
344

345
### CUDA Graphs
346

347
CUDA graphs for optimizing sequences of operations.
348

349
```python { .api }
350
class Graph:
351
    """
352
    CUDA graph for capturing and replaying operation sequences.
353
    
354
    Graphs allow capturing a sequence of CUDA operations and
355
    replaying them efficiently with reduced launch overhead.
356
    """
357
    def __init__(self): ...
358
    
359
    def begin_capture(self, stream=None): ...
360
    def end_capture(self, stream=None): ...
361
    def launch(self, stream=None): ...
362
    def debug_dot_print(self, path): ...
363
```
364

365
### Kernels and Modules
366

367
CUDA kernel compilation and execution management.
368

369
```python { .api }
370
class Function:
371
    """
372
    CUDA function object.
373
    
374
    Represents a compiled CUDA kernel function that can be launched
375
    with specified grid and block dimensions.
376
    """
377
    def __init__(self, module, name):
378
        """
379
        Parameters:
380
            module: Module - CUDA module containing the function
381
            name: str - Function name
382
        """
383
    
384
    def __call__(self, grid, block, args, **kwargs): ...
385
    
386
    @property
387
    def attributes(self):
388
        """Get function attributes."""
389

390
class Module:
391
    """
392
    CUDA module containing compiled device code.
393
    
394
    Modules contain one or more CUDA kernels and can be loaded
395
    from PTX or CUBIN code.
396
    """
397
    def __init__(self): ...
398
    
399
    def get_function(self, name): ...
400
    def get_global(self, name): ...
401
    def get_texref(self, name): ...
402
    
403
    @classmethod
404
    def load_file(cls, filename): ...
405
    
406
    @classmethod
407
    def load_from_string(cls, source): ...
408
```
409

410
### Memory Hooks
411

412
Hooks for monitoring and controlling memory allocation behavior.
413

414
```python { .api }
415
class MemoryHook:
416
    """
417
    Base class for memory allocation hooks.
418
    
419
    Memory hooks allow monitoring and customization of memory
420
    allocation and deallocation operations.
421
    """
422
    def alloc_preprocess(self, **kwargs): ...
423
    def alloc_postprocess(self, mem): ...
424
    def free_preprocess(self, mem): ...
425
    def free_postprocess(self, mem): ...
426
```
427

428
### Profiling and Debugging
429

430
Tools for profiling and debugging CUDA applications.
431

432
```python { .api }
433
def profile():
434
    """
435
    Context manager for CUDA profiling (deprecated).
436
    
437
    Note: This is deprecated. Use cupyx.profiler.profile() instead.
438
    """
439
```
440

441
### Environment Information
442

443
Functions for querying CUDA runtime and environment information.
444

445
```python { .api }
446
def get_local_runtime_version():
447
    """
448
    Get the local CUDA runtime version.
449
    """
450

451
def get_cuda_path():
452
    """
453
    Get the CUDA installation path.
454
    """
455

456
def get_nvcc_path():
457
    """
458
    Get the path to nvcc compiler.
459
    """
460

461
def get_rocm_path():
462
    """
463
    Get the ROCm installation path (for AMD GPUs).
464
    """
465

466
def get_hipcc_path():
467
    """
468
    Get the path to hipcc compiler (for AMD GPUs).
469
    """
470
```
471

472
### Low-level API Access
473

474
Access to low-level CUDA APIs for advanced users.
475

476
```python { .api }
477
# CUDA Driver API
478
driver = cupy.cuda.driver
479

480
# CUDA Runtime API  
481
runtime = cupy.cuda.runtime
482

483
# NVRTC Compiler API
484
nvrtc = cupy.cuda.nvrtc
485

486
# Backend library wrappers (lazy-loaded)
487
cublas = cupy.cuda.cublas      # cuBLAS operations
488
cusolver = cupy.cuda.cusolver  # cuSOLVER linear algebra
489
cusparse = cupy.cuda.cusparse  # cuSPARSE sparse operations
490
curand = cupy.cuda.curand      # cuRAND random numbers
491
nvtx = cupy.cuda.nvtx          # NVTX profiling markers
492
```
493

494
## Usage Examples
495

496
```python
497
import cupy as cp
498
import cupy.cuda as cuda
499

500
# Device management
501
print(f"Current device: {cuda.get_device_id()}")
502
print(f"CUDA available: {cuda.is_available()}")
503

504
# Using specific devices
505
with cuda.Device(0):
506
    # Operations on device 0
507
    x = cp.array([1, 2, 3])
508
    
509
with cuda.Device(1):  # If multiple GPUs available
510
    # Operations on device 1
511
    y = cp.array([4, 5, 6])
512

513
# Memory management
514
# Direct memory allocation
515
mem = cuda.alloc(1024)  # Allocate 1KB
516
ptr = cuda.MemoryPointer(mem, 0, 1024)
517

518
# Using memory pools (recommended)
519
pool = cuda.MemoryPool()
520
with cuda.using_allocator(pool.malloc):
521
    # All allocations use the pool
522
    large_array = cp.zeros((10000, 10000))
523

524
# Memory pool statistics
525
print(f"Used memory: {pool.used_bytes()} bytes")
526
print(f"Total memory: {pool.total_bytes()} bytes")
527

528
# Stream management for asynchronous operations
529
stream1 = cuda.Stream()
530
stream2 = cuda.Stream()
531

532
with stream1:
533
    # Operations executed on stream1
534
    a = cp.random.rand(1000, 1000)
535
    b = cp.random.rand(1000, 1000)
536
    
537
with stream2:
538
    # Operations executed on stream2 (can overlap with stream1)
539
    c = cp.random.rand(1000, 1000)
540
    d = cp.random.rand(1000, 1000)
541

542
# Synchronization
543
stream1.synchronize()  # Wait for stream1 to complete
544
stream2.synchronize()  # Wait for stream2 to complete
545

546
# Event-based synchronization
547
event = cuda.Event()
548
with stream1:
549
    result1 = cp.dot(a, b)
550
    event.record()  # Record completion of operations
551

552
with stream2:
553
    stream2.wait_event(event)  # Wait for stream1 operations
554
    result2 = cp.dot(c, d) + result1  # Uses result from stream1
555

556
# Measuring execution time with events
557
start_event = cuda.Event()
558
end_event = cuda.Event()
559

560
start_event.record()
561
# Some operations
562
large_computation = cp.dot(cp.random.rand(5000, 5000), 
563
                          cp.random.rand(5000, 5000))
564
end_event.record()
565
end_event.synchronize()
566

567
elapsed_ms = cuda.get_elapsed_time(start_event, end_event)
568
print(f"Computation took {elapsed_ms} ms")
569

570
# Pinned memory for faster transfers
571
pinned_mem = cuda.alloc_pinned_memory(1000 * 8)  # 1000 float64s
572
pinned_array = cp.ndarray((1000,), dtype=cp.float64, 
573
                         memptr=cuda.MemoryPointer(pinned_mem, 0, 1000 * 8))
574

575
# Custom kernel example using RawKernel
576
kernel_code = r'''
577
extern "C" __global__
578
void vector_add(float* x, float* y, float* z, int n) {
579
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
580
    if (tid < n) {
581
        z[tid] = x[tid] + y[tid];
582
    }
583
}
584
'''
585

586
kernel = cp.RawKernel(kernel_code, 'vector_add')
587

588
# Launch custom kernel
589
n = 1000
590
x = cp.random.rand(n, dtype=cp.float32)
591
y = cp.random.rand(n, dtype=cp.float32)
592
z = cp.zeros(n, dtype=cp.float32)
593

594
# Launch with appropriate grid/block size
595
threads_per_block = 256
596
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
597
kernel((blocks_per_grid,), (threads_per_block,), (x, y, z, n))
598

599
# Memory hooks for monitoring
600
class MemoryTracker(cuda.MemoryHook):
601
    def __init__(self):
602
        self.allocated_bytes = 0
603
        self.freed_bytes = 0
604
    
605
    def alloc_postprocess(self, mem):
606
        self.allocated_bytes += mem.size
607
        print(f"Allocated {mem.size} bytes")
608
    
609
    def free_preprocess(self, mem):
610
        self.freed_bytes += mem.size
611
        print(f"Freed {mem.size} bytes")
612

613
tracker = MemoryTracker()
614
# Note: Memory hooks integration depends on CuPy version
615

616
# Working with CUDA graphs (for CUDA 10.0+)
617
if hasattr(cuda, 'Graph'):
618
    graph = cuda.Graph()
619
    
620
    # Capture operations in a graph
621
    stream = cuda.Stream()
622
    with stream:
623
        graph.begin_capture(stream)
624
        
625
        # Operations to be captured
626
        x = cp.random.rand(1000, 1000)
627
        y = cp.random.rand(1000, 1000)
628
        z = x @ y
629
        
630
        graph.end_capture(stream)
631
    
632
    # Replay the graph multiple times efficiently
633
    for _ in range(10):
634
        graph.launch(stream)
635
        stream.synchronize()
636

637
# Multi-GPU computation example
638
def multi_gpu_computation(data_list):
639
    """Distribute computation across multiple GPUs."""
640
    n_gpus = cuda.runtime.getDeviceCount()
641
    streams = []
642
    results = []
643
    
644
    for i, data in enumerate(data_list[:n_gpus]):
645
        device_id = i % n_gpus
646
        with cuda.Device(device_id):
647
            stream = cuda.Stream()
648
            streams.append(stream)
649
            
650
            with stream:
651
                # Transfer data to this GPU
652
                gpu_data = cp.asarray(data)
653
                # Perform computation
654
                result = cp.sum(gpu_data ** 2)
655
                results.append(result)
656
    
657
    # Synchronize all streams
658
    for stream in streams:
659
        stream.synchronize()
660
    
661
    return results
662

663
# Memory bandwidth benchmark
664
def memory_bandwidth_test(size_mb=100):
665
    """Test memory bandwidth between host and device."""
666
    size_bytes = size_mb * 1024 * 1024
667
    
668
    # Host memory
669
    host_data = cp.asnumpy(cp.random.rand(size_bytes // 8))
670
    
671
    # Pinned host memory for faster transfers
672
    pinned_mem = cuda.alloc_pinned_memory(size_bytes)
673
    
674
    # Time regular vs pinned memory transfers
675
    import time
676
    
677
    # Regular host memory
678
    start = time.time()
679
    gpu_data1 = cp.asarray(host_data)
680
    cp.cuda.synchronize()
681
    regular_time = time.time() - start
682
    
683
    # Pinned memory (requires copying to pinned first)
684
    start = time.time()
685
    # Copy to pinned then to GPU would be done here
686
    # This is a simplified example
687
    pinned_time = time.time() - start
688
    
689
    bandwidth_regular = size_mb / regular_time
690
    print(f"Regular memory bandwidth: {bandwidth_regular:.2f} MB/s")
691

692
# Advanced memory pool configuration
693
def configure_memory_pool():
694
    """Configure memory pool for optimal performance."""
695
    # Get the default memory pool
696
    mempool = cp.get_default_memory_pool()
697
    
698
    # Set memory pool growth strategy
699
    # mempool.set_limit(size=2**30)  # Limit to 1GB
700
    
701
    # Monitor memory usage
702
    print(f"Used bytes: {mempool.used_bytes()}")
703
    print(f"Total bytes: {mempool.total_bytes()}")
704
    
705
    # Force cleanup of unused memory
706
    mempool.free_all_free()
707
    
708
    return mempool
709

710
# Context management for robust error handling
711
def safe_gpu_computation():
712
    """Example of robust GPU computation with proper cleanup."""
713
    stream = None
714
    temp_arrays = []
715
    
716
    try:
717
        stream = cuda.Stream()
718
        
719
        with stream:
720
            # Temporary arrays that need cleanup
721
            temp1 = cp.random.rand(10000, 10000)
722
            temp2 = cp.random.rand(10000, 10000)
723
            temp_arrays.extend([temp1, temp2])
724
            
725
            # Main computation
726
            result = temp1 @ temp2
727
            
728
            # Synchronize to ensure completion
729
            stream.synchronize()
730
            
731
        return result
732
        
733
    except Exception as e:
734
        print(f"GPU computation failed: {e}")
735
        return None
736
        
737
    finally:
738
        # Cleanup resources
739
        if stream:
740
            stream.synchronize()
741
        
742
        # Force garbage collection of temporary arrays
743
        del temp_arrays
744
        cp.get_default_memory_pool().free_all_free()
745
```
746

747
## Performance Optimization Tips
748

749
### Memory Management
750

751
```python
752
# Use memory pools to reduce allocation overhead
753
with cuda.using_allocator(cp.get_default_memory_pool().malloc):
754
    # All allocations reuse memory from the pool
755
    data = cp.zeros((10000, 10000))
756

757
# Pre-allocate large arrays when possible
758
workspace = cp.zeros((10000, 10000))  # Reuse this array
759

760
# Use appropriate memory types
761
regular_mem = cuda.alloc(1024)           # Regular device memory
762
managed_mem = cuda.malloc_managed(1024)   # Unified memory
763
```
764

765
### Stream Optimization
766

767
```python
768
# Overlap computation and memory transfers
769
compute_stream = cuda.Stream()
770
transfer_stream = cuda.Stream()
771

772
with transfer_stream:
773
    # Asynchronous memory transfer
774
    next_data = cp.asarray(host_data)
775

776
with compute_stream:
777
    # Parallel computation
778
    result = process_current_data(current_data)
779
```
780

781
### Kernel Launch Optimization
782

783
```python
784
# Choose optimal grid/block dimensions
785
def optimal_launch_config(n, max_threads_per_block=1024):
786
    """Calculate optimal CUDA launch configuration."""
787
    if n <= max_threads_per_block:
788
        return (1, n)
789
    else:
790
        threads_per_block = max_threads_per_block
791
        blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
792
        return (blocks_per_grid, threads_per_block)
793

794
grid, block = optimal_launch_config(1000000)
795
```
796

797
CUDA integration in CuPy provides comprehensive low-level GPU programming capabilities, enabling advanced memory management, asynchronous execution, custom kernel development, and performance optimization for high-performance computing applications while maintaining compatibility with the broader CUDA ecosystem.

Version

Tile

Files

cuda-integration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

cuda-integration.mddocs/