Tessl Tile for pypi/pycuda@2025.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-pycuda

Python wrapper for Nvidia CUDA parallel computation API with object cleanup, automatic error checking, and convenient abstractions.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/pycuda@2025.1.x

To install, run

npx @tessl/cli install tessl/pypi-pycuda@2025.1.0

0
# PyCUDA
1

2
A comprehensive Python wrapper for Nvidia's CUDA parallel computation API that provides Pythonic access to GPU computing capabilities. PyCUDA offers object cleanup tied to object lifetime (RAII pattern), automatic error checking that translates all CUDA errors into Python exceptions, and convenient abstractions like GPUArray for GPU memory management.
3

4
## Package Information
5

6
- **Package Name**: pycuda
7
- **Language**: Python with C++ extensions
8
- **Installation**: `pip install pycuda`
9
- **Documentation**: https://documen.tician.de/pycuda
10
- **License**: MIT
11

12
## Core Imports
13

14
```python
15
import pycuda.driver as cuda
16
```
17

18
GPU array operations:
19

20
```python
21
import pycuda.gpuarray as gpuarray
22
```
23

24
Auto-initialization (convenient but less control):
25

26
```python
27
import pycuda.autoinit  # Automatically initializes CUDA context
28
```
29

30
Kernel compilation:
31

32
```python
33
from pycuda.compiler import SourceModule
34
```
35

36
## Basic Usage
37

38
```python
39
import pycuda.driver as cuda
40
import pycuda.autoinit
41
import pycuda.gpuarray as gpuarray
42
import numpy as np
43

44
# Create GPU array from NumPy array
45
cpu_array = np.array([1, 2, 3, 4, 5], dtype=np.float32)
46
gpu_array = gpuarray.to_gpu(cpu_array)
47

48
# Perform operations on GPU
49
result = gpu_array * 2.0
50

51
# Copy result back to CPU
52
cpu_result = result.get()
53
print(cpu_result)  # [2. 4. 6. 8. 10.]
54

55
# Manual kernel example
56
kernel_code = """
57
__global__ void double_array(float *a, int n) {
58
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
59
    if (idx < n) {
60
        a[idx] = a[idx] * 2.0;
61
    }
62
}
63
"""
64

65
# Compile and run kernel
66
mod = SourceModule(kernel_code)
67
double_func = mod.get_function("double_array")
68

69
# Execute kernel
70
block_size = 256
71
grid_size = (len(cpu_array) + block_size - 1) // block_size
72
double_func(gpu_array, np.int32(len(cpu_array)), 
73
           block=(block_size, 1, 1), grid=(grid_size, 1))
74
```
75

76
## Architecture
77

78
PyCUDA's layered architecture provides both low-level control and high-level convenience:
79

80
- **Driver Layer**: Direct access to CUDA driver API with Pythonic error handling and memory management
81
- **Compiler Layer**: Dynamic CUDA kernel compilation and module management with caching
82
- **GPUArray Layer**: NumPy-like interface for GPU arrays with automatic memory management
83
- **Algorithm Layer**: Pre-built kernels for common operations (elementwise, reduction, scan)
84
- **Utility Layer**: Helper functions, memory pools, and device characterization tools
85

86
This design enables everything from simple array operations to complex custom kernel development, with automatic resource cleanup and comprehensive error checking throughout.
87

88
## Capabilities
89

90
### Driver API
91

92
Low-level CUDA driver API access providing direct control over contexts, devices, memory, streams, and events. This forms the foundation for all GPU operations.
93

94
```python { .api }
95
def init(flags: int = 0) -> None: ...
96
def mem_alloc(size: int) -> DeviceAllocation: ...
97
def mem_get_info() -> tuple[int, int]: ...
98
def memcpy_htod(dest: DeviceAllocation, src) -> None: ...
99
def memcpy_dtoh(dest, src: DeviceAllocation) -> None: ...
100
```
101

102
[Driver API](./driver-api.md)
103

104
### GPU Arrays
105

106
High-level NumPy-like interface for GPU arrays supporting arithmetic operations, slicing, broadcasting, and seamless interoperability with NumPy arrays.
107

108
```python { .api }
109
class GPUArray:
110
    def __init__(self, shape, dtype, allocator=None): ...
111
    def get(self) -> np.ndarray: ...
112
    def set(self, ary: np.ndarray) -> None: ...
113
    def __add__(self, other): ...
114
    def __mul__(self, other): ...
115
```
116

117
[GPU Arrays](./gpu-arrays.md)
118

119
### Kernel Compilation
120

121
Dynamic CUDA kernel compilation with source code generation, caching, and module management for both inline and file-based CUDA source code.
122

123
```python { .api }
124
class SourceModule:
125
    def __init__(self, source: str, **kwargs): ...
126
    def get_function(self, name: str) -> Function: ...
127
    def get_global(self, name: str) -> tuple[DeviceAllocation, int]: ...
128
```
129

130
[Kernel Compilation](./kernel-compilation.md)
131

132
### Algorithm Kernels
133

134
Pre-built, optimized kernels for common parallel operations including element-wise operations, reductions, and prefix scans with automatic type handling.
135

136
```python { .api }
137
class ElementwiseKernel:
138
    def __init__(self, arguments: str, operation: str, **kwargs): ...
139
    def __call__(self, *args, **kwargs): ...
140

141
class ReductionKernel:
142
    def __init__(self, dtype, neutral: str, reduce_expr: str, **kwargs): ...
143
    def __call__(self, gpu_array): ...
144
```
145

146
[Algorithm Kernels](./algorithm-kernels.md)
147

148
### Math Functions
149

150
CUDA math function wrappers providing GPU-accelerated mathematical operations for arrays including trigonometric, exponential, and logarithmic functions.
151

152
```python { .api }
153
def sin(array, **kwargs): ...
154
def cos(array, **kwargs): ...
155
def exp(array, **kwargs): ...
156
def log(array, **kwargs): ...
157
def sqrt(array, **kwargs): ...
158
```
159

160
[Math Functions](./math-functions.md)
161

162
### Random Number Generation
163

164
GPU-accelerated random number generation with support for various distributions and reproducible seeding for scientific computing applications.
165

166
```python { .api }
167
def rand(shape, dtype=np.float32, stream=None): ...
168
def seed_getter_uniform(n: int): ...
169
def seed_getter_unique(n: int): ...
170
```
171

172
[Random Numbers](./random-numbers.md)
173

174
### OpenGL Interoperability
175

176
Integration with OpenGL for graphics programming, allowing sharing of buffer objects and textures between CUDA and OpenGL contexts.
177

178
```python { .api }
179
def init() -> None: ...
180
def make_context(device: Device) -> Context: ...
181
class BufferObject: ...
182
class RegisteredBuffer: ...
183
```
184

185
[OpenGL Integration](./opengl-integration.md)
186

187
## Common Types
188

189
```python { .api }
190
class Device:
191
    def count() -> int: ...
192
    def get_device(device_no: int) -> Device: ...
193
    def compute_capability() -> tuple[int, int]: ...
194
    def name() -> str: ...
195

196
class Context:
197
    def __init__(self, device: Device, flags: int = 0): ...
198
    def push(self) -> None: ...
199
    def pop(self) -> Context: ...
200
    def get_device() -> Device: ...
201

202
class DeviceAllocation:
203
    def __int__(self) -> int: ...
204
    def __len__(self) -> int: ...
205

206
class Function:
207
    def __call__(self, *args, **kwargs) -> None: ...
208
    def prepare(self, arg_types) -> PreparedFunction: ...
209

210
class Stream:
211
    def __init__(self, flags: int = 0): ...
212
    def synchronize(self) -> None: ...
213
    def is_done() -> bool: ...
214

215
class Event:
216
    def __init__(self, flags: int = 0): ...
217
    def record(self, stream: Stream = None) -> None: ...
218
    def synchronize(self) -> None: ...
219
    def query() -> bool: ...
220
    def time_since(self, start_event: Event) -> float: ...
221
```