Tessl Tile for pypi/cupy@13.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-creation.md cuda-management.md fft.md index.md kernels.md linear-algebra.md math-functions.md random.md scipy-extensions.md sparse.md statistics.md

cuda-management.mddocs/

0
# CUDA Memory and Device Management
1

2
Low-level CUDA functionality for memory allocation, device management, and stream operations. These features enable fine-grained control over GPU resources and memory optimization for high-performance computing applications.
3

4
## Capabilities
5

6
### Device Management
7

8
Control and query CUDA devices and contexts.
9

10
```python { .api }
11
class Device:
12
    """
13
    CUDA device context manager.
14
    
15
    Parameters:
16
    - device: int or None, device ID to use (None for current)
17
    """
18
    def __init__(self, device=None): ...
19
    def __enter__(self): ...
20
    def __exit__(self, *args): ...
21
    def use(self): ...
22
    def synchronize(self): ...
23
    
24
    @property
25
    def id(self): ...
26

27
def get_device_id():
28
    """
29
    Get current device ID.
30
    
31
    Returns:
32
    int: Current CUDA device ID
33
    """
34

35
def is_available():
36
    """
37
    Check if CUDA is available.
38
    
39
    Returns:
40
    bool: True if CUDA devices are available
41
    """
42

43
def get_local_runtime_version():
44
    """
45
    Get local CUDA runtime version.
46
    
47
    Returns:
48
    int: CUDA runtime version
49
    """
50

51
def get_cublas_handle():
52
    """
53
    Get cuBLAS handle for current device.
54
    
55
    Returns:
56
    int: cuBLAS handle
57
    """
58
```
59

60
### Memory Management
61

62
GPU memory allocation and management with automatic pooling.
63

64
```python { .api }
65
def alloc(size):
66
    """
67
    Allocate device memory.
68
    
69
    Parameters:
70
    - size: int, size in bytes to allocate
71
    
72
    Returns:
73
    Memory: Memory object wrapping allocated GPU memory
74
    """
75

76
def malloc_managed(size):
77
    """
78
    Allocate managed (unified) memory.
79
    
80
    Parameters:
81
    - size: int, size in bytes to allocate
82
    
83
    Returns:
84
    ManagedMemory: Managed memory object accessible from CPU and GPU
85
    """
86

87
def malloc_async(size, stream=None):
88
    """
89
    Allocate memory asynchronously.
90
    
91
    Parameters:
92
    - size: int, size in bytes to allocate
93
    - stream: Stream or None, CUDA stream for allocation
94
    
95
    Returns:
96
    MemoryAsync: Asynchronous memory object
97
    """
98

99
class Memory:
100
    """Device memory object."""
101
    @property
102
    def ptr(self): ...
103
    @property
104
    def size(self): ...
105
    def __int__(self): ...
106

107
class ManagedMemory:
108
    """Managed memory object accessible from CPU and GPU."""
109
    @property
110
    def ptr(self): ...
111
    @property  
112
    def size(self): ...
113

114
class MemoryAsync:
115
    """Asynchronous memory object."""
116
    @property
117
    def ptr(self): ...
118
    @property
119
    def size(self): ...
120

121
class MemoryPointer:
122
    """
123
    Pointer to device memory with automatic memory management.
124
    
125
    Parameters:
126
    - mem: Memory, underlying memory object
127
    - offset: int, offset in bytes from memory start
128
    """
129
    def __init__(self, mem, offset): ...
130
    @property
131
    def ptr(self): ...
132
    @property
133
    def size(self): ...
134

135
class UnownedMemory:
136
    """Wrapper for externally managed memory."""
137
    def __init__(self, ptr, size, owner): ...
138
```
139

140
### Memory Pools
141

142
Efficient memory allocation through pooling to reduce allocation overhead.
143

144
```python { .api }
145
class MemoryPool:
146
    """
147
    Memory pool for device memory allocation.
148
    
149
    Parameters:
150
    - allocator: function or None, custom allocator function
151
    """
152
    def __init__(self, allocator=None): ...
153
    
154
    def malloc(self, size):
155
        """
156
        Allocate memory from pool.
157
        
158
        Parameters:
159
        - size: int, size in bytes
160
        
161
        Returns:
162
        MemoryPointer: Pointer to allocated memory
163
        """
164
    
165
    def free_all_blocks(self):
166
        """Free all allocated blocks in pool."""
167
    
168
    def free_all_free(self):
169
        """Free all currently unused blocks."""
170
    
171
    def n_free_blocks(self):
172
        """
173
        Number of free blocks.
174
        
175
        Returns:
176
        int: Number of free blocks
177
        """
178
    
179
    def used_bytes(self):
180
        """
181
        Total bytes in use.
182
        
183
        Returns:
184
        int: Bytes currently allocated
185
        """
186
    
187
    def free_bytes(self):
188
        """
189
        Total bytes in free blocks.
190
        
191
        Returns:
192
        int: Bytes in free blocks
193
        """
194
    
195
    def total_bytes(self):
196
        """
197
        Total bytes managed by pool.
198
        
199
        Returns:
200
        int: Total bytes (used + free)
201
        """
202

203
class MemoryAsyncPool:
204
    """Asynchronous memory pool."""
205
    def __init__(self, allocator=None): ...
206
    def malloc(self, size, stream=None): ...
207
    def free_all_blocks(self): ...
208

209
def get_default_memory_pool():
210
    """
211
    Get default GPU memory pool.
212
    
213
    Returns:
214
    MemoryPool: Default memory pool for current device
215
    """
216

217
def get_default_pinned_memory_pool():
218
    """
219
    Get default pinned memory pool.
220
    
221
    Returns:
222
    PinnedMemoryPool: Default pinned memory pool
223
    """
224
```
225

226
### Memory Allocators
227

228
Custom memory allocation strategies.
229

230
```python { .api }
231
class PythonFunctionAllocator:
232
    """
233
    Python function-based memory allocator.
234
    
235
    Parameters:
236
    - func: function, allocator function taking size and returning Memory
237
    """
238
    def __init__(self, func): ...
239

240
class CFunctionAllocator:
241
    """
242
    C function-based memory allocator.
243
    
244
    Parameters:
245
    - intptr: int, pointer to C allocator function
246
    """
247
    def __init__(self, intptr): ...
248

249
def set_allocator(allocator):
250
    """
251
    Set thread-local memory allocator.
252
    
253
    Parameters:
254
    - allocator: function or None, allocator function
255
    """
256

257
def get_allocator():
258
    """
259
    Get current thread-local allocator.
260
    
261
    Returns:
262
    function: Current allocator function
263
    """
264

265
def using_allocator(allocator=None):
266
    """
267
    Context manager for temporary allocator.
268
    
269
    Parameters:
270
    - allocator: function or None, temporary allocator
271
    
272
    Returns:
273
    context manager: Restores previous allocator on exit
274
    """
275
```
276

277
### Pinned Memory Management
278

279
Host memory that can be accessed efficiently by GPU.
280

281
```python { .api }
282
def alloc_pinned_memory(size):
283
    """
284
    Allocate pinned host memory.
285
    
286
    Parameters:
287
    - size: int, size in bytes to allocate
288
    
289
    Returns:
290
    PinnedMemory: Pinned memory object
291
    """
292

293
class PinnedMemory:
294
    """Pinned host memory object."""
295
    @property
296
    def ptr(self): ...
297
    @property
298
    def size(self): ...
299
    def __int__(self): ...
300

301
class PinnedMemoryPointer:
302
    """
303
    Pointer to pinned memory.
304
    
305
    Parameters:
306
    - mem: PinnedMemory, underlying memory object
307
    - offset: int, offset in bytes
308
    """
309
    def __init__(self, mem, offset): ...
310
    @property
311
    def ptr(self): ...
312
    @property
313
    def size(self): ...
314

315
class PinnedMemoryPool:
316
    """
317
    Memory pool for pinned memory allocation.
318
    
319
    Parameters:
320
    - allocator: function or None, custom allocator
321
    """
322
    def __init__(self, allocator=None): ...
323
    def malloc(self, size): ...
324
    def free_all_blocks(self): ...
325
    def used_bytes(self): ...
326
    def free_bytes(self): ...
327
    def total_bytes(self): ...
328

329
def set_pinned_memory_allocator(allocator):
330
    """
331
    Set pinned memory allocator.
332
    
333
    Parameters:
334
    - allocator: function or None, allocator function
335
    """
336
```
337

338
### Stream Management  
339

340
CUDA streams for asynchronous execution and memory operations.
341

342
```python { .api }
343
class Stream:
344
    """
345
    CUDA stream for asynchronous operations.
346
    
347
    Parameters:
348
    - null: bool, whether to use null (default) stream
349
    - non_blocking: bool, whether stream can run concurrently with null stream
350
    - priority: int, stream priority (lower = higher priority)
351
    """
352
    def __init__(self, null=False, non_blocking=False, priority=0): ...
353
    
354
    def synchronize(self):
355
        """Wait for all operations in stream to complete."""
356
    
357
    def add_callback(self, callback, arg):
358
        """
359
        Add callback to be called when stream operations complete.
360
        
361
        Parameters:
362
        - callback: function, callback function
363
        - arg: object, argument to pass to callback
364
        """
365
    
366
    def record(self, event=None):
367
        """
368
        Record event in stream.
369
        
370
        Parameters:
371
        - event: Event or None, event to record
372
        
373
        Returns:
374
        Event: Recorded event
375
        """
376
    
377
    def wait_event(self, event):
378
        """
379
        Make stream wait for event.
380
        
381
        Parameters:
382
        - event: Event, event to wait for
383
        """
384
    
385
    @property
386
    def ptr(self): ...
387

388
class ExternalStream:
389
    """
390
    Wrapper for externally created CUDA stream.
391
    
392
    Parameters:
393
    - ptr: int, pointer to existing CUDA stream
394
    """
395
    def __init__(self, ptr): ...
396
    def synchronize(self): ...
397
    @property
398
    def ptr(self): ...
399

400
def get_current_stream():
401
    """
402
    Get current CUDA stream.
403
    
404
    Returns:
405
    Stream: Current stream for active device
406
    """
407
```
408

409
### Event Management
410

411
CUDA events for synchronization and timing.
412

413
```python { .api }
414
class Event:
415
    """
416
    CUDA event for synchronization and timing.
417
    
418
    Parameters:
419
    - block: bool, whether to block host thread
420
    - disable_timing: bool, whether to disable timing capability
421
    - interprocess: bool, whether event can be shared between processes
422
    """
423
    def __init__(self, block=False, disable_timing=False, interprocess=False): ...
424
    
425
    def record(self, stream=None):
426
        """
427
        Record event in stream.
428
        
429
        Parameters:
430
        - stream: Stream or None, stream to record in
431
        """
432
    
433
    def synchronize(self):
434
        """Wait for event to complete."""
435
    
436
    def query(self):
437
        """
438
        Query event completion status.
439
        
440
        Returns:
441
        bool: True if event has completed
442
        """
443
    
444
    @property
445
    def ptr(self): ...
446

447
def get_elapsed_time(start_event, end_event):
448
    """
449
    Get elapsed time between events.
450
    
451
    Parameters:
452
    - start_event: Event, start event
453
    - end_event: Event, end event
454
    
455
    Returns:
456
    float: Elapsed time in milliseconds
457
    """
458
```
459

460
### CUDA Graphs
461

462
Capture and replay sequences of CUDA operations.
463

464
```python { .api }
465
class Graph:
466
    """CUDA graph for capturing and replaying operation sequences."""
467
    def __init__(self): ...
468
    
469
    def capture_begin(self, stream=None):
470
        """
471
        Begin capturing operations into graph.
472
        
473
        Parameters:
474
        - stream: Stream or None, stream to capture
475
        """
476
    
477
    def capture_end(self, stream=None):
478
        """
479
        End capturing operations.
480
        
481
        Parameters:
482
        - stream: Stream or None, stream that was captured
483
        """
484
    
485
    def launch(self, stream=None):
486
        """
487
        Launch (replay) captured graph.
488
        
489
        Parameters:
490
        - stream: Stream or None, stream to launch in
491
        """
492
```
493

494
### Data Transfer Utilities
495

496
High-level utilities for CPU-GPU data transfer.
497

498
```python { .api }
499
def asnumpy(a, stream=None, order='C', out=None, *, blocking=True):
500
    """
501
    Transfer CuPy array to NumPy array on CPU.
502
    
503
    Parameters:
504
    - a: cupy.ndarray, GPU array to transfer
505
    - stream: Stream or None, CUDA stream for async transfer
506
    - order: {'C', 'F', 'A'}, memory layout of result
507
    - out: numpy.ndarray or None, pre-allocated output array
508
    - blocking: bool, whether to block until transfer complete
509
    
510
    Returns:
511
    numpy.ndarray: CPU array with copied data
512
    """
513

514
def get_array_module(*args):
515
    """
516
    Get appropriate array module (cupy or numpy) based on input types.
517
    
518
    Parameters:
519
    - args: array-like objects to check
520
    
521
    Returns:
522
    module: cupy if any arg is CuPy array, otherwise numpy
523
    """
524
```
525

526
## Usage Examples
527

528
### Basic Device and Memory Management
529

530
```python
531
import cupy as cp
532

533
# Device management
534
print(f"CUDA available: {cp.cuda.is_available()}")
535
print(f"Current device: {cp.cuda.get_device_id()}")
536

537
# Switch devices
538
with cp.cuda.Device(1):  # Use device 1 within context
539
    arr = cp.zeros((1000, 1000))
540
    print(f"Array on device: {arr.device.id}")
541

542
# Memory pool management
543
pool = cp.get_default_memory_pool()
544
print(f"Memory usage: {pool.used_bytes()} bytes")
545
print(f"Free blocks: {pool.n_free_blocks()}")
546

547
# Free unused memory
548
pool.free_all_free()
549
```
550

551
### Advanced Memory Management
552

553
```python
554
# Custom allocator for memory tracking
555
def tracking_allocator(size):
556
    print(f"Allocating {size} bytes")
557
    return cp.cuda.alloc(size)
558

559
# Use custom allocator temporarily
560
with cp.cuda.using_allocator(tracking_allocator):
561
    arr = cp.ones((1000, 1000))  # Will print allocation size
562

563
# Pinned memory for faster transfers
564
pinned_mem = cp.cuda.alloc_pinned_memory(1000 * 8)  # 1000 float64s
565
gpu_arr = cp.zeros(1000)
566

567
# Async memory allocation (when supported)
568
stream = cp.cuda.Stream()
569
async_mem = cp.cuda.malloc_async(1000 * 4, stream)
570
```
571

572
### Stream-based Asynchronous Operations
573

574
```python
575
# Create streams for concurrent execution
576
stream1 = cp.cuda.Stream()
577
stream2 = cp.cuda.Stream()
578

579
# Perform operations on different streams
580
with stream1:
581
    arr1 = cp.random.random((1000, 1000))
582
    result1 = cp.dot(arr1, arr1)
583

584
with stream2:
585
    arr2 = cp.random.random((1000, 1000))
586
    result2 = cp.dot(arr2, arr2)
587

588
# Synchronize streams
589
stream1.synchronize()
590
stream2.synchronize()
591

592
# Events for timing and synchronization
593
start_event = cp.cuda.Event()
594
end_event = cp.cuda.Event()
595

596
start_event.record()
597
# ... GPU operations ...
598
end_event.record()
599
end_event.synchronize()
600

601
elapsed_time = cp.cuda.get_elapsed_time(start_event, end_event)
602
print(f"Operation took {elapsed_time:.2f} ms")
603
```
604

605
### Efficient Data Transfer Patterns
606

607
```python
608
# Asynchronous transfers with streams
609
stream = cp.cuda.Stream()
610

611
# CPU array
612
cpu_data = np.random.random((10000, 1000))
613

614
# Transfer to GPU asynchronously
615
gpu_data = cp.asarray(cpu_data)  # Synchronous by default
616

617
# For truly async transfer, use lower-level operations
618
gpu_buffer = cp.empty_like(cpu_data)
619
# ... use CUDA runtime API for async memcpy ...
620

621
# Transfer results back to CPU
622
with stream:
623
    result_gpu = cp.dot(gpu_data, gpu_data.T)
624
    
625
# Async transfer back (non-blocking)
626
result_cpu = cp.asnumpy(result_gpu, stream=stream, blocking=False)
627
stream.synchronize()  # Wait for completion
628
```
629

630
### Memory Optimization Strategies
631

632
```python
633
# Monitor memory usage
634
def print_memory_info():
635
    pool = cp.get_default_memory_pool()
636
    print(f"Used: {pool.used_bytes() / 1e9:.2f} GB")
637
    print(f"Free: {pool.free_bytes() / 1e9:.2f} GB")
638

639
print_memory_info()
640

641
# Large computation with memory management
642
for i in range(100):
643
    # Large temporary arrays
644
    temp = cp.random.random((5000, 5000))
645
    result = cp.dot(temp, temp)
646
    
647
    # Explicit cleanup every 10 iterations
648
    if i % 10 == 0:
649
        del temp, result
650
        cp.get_default_memory_pool().free_all_free()
651
        print_memory_info()
652

653
# Use memory mapping for very large datasets
654
# (requires careful memory management)
655
```

Version

Tile

Files

cuda-management.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

cuda-management.mddocs/