0
# PyCUDA
1
2
A comprehensive Python wrapper for Nvidia's CUDA parallel computation API that provides Pythonic access to GPU computing capabilities. PyCUDA offers object cleanup tied to object lifetime (RAII pattern), automatic error checking that translates all CUDA errors into Python exceptions, and convenient abstractions like GPUArray for GPU memory management.
3
4
## Package Information
5
6
- **Package Name**: pycuda
7
- **Language**: Python with C++ extensions
8
- **Installation**: `pip install pycuda`
9
- **Documentation**: https://documen.tician.de/pycuda
10
- **License**: MIT
11
12
## Core Imports
13
14
```python
15
import pycuda.driver as cuda
16
```
17
18
GPU array operations:
19
20
```python
21
import pycuda.gpuarray as gpuarray
22
```
23
24
Auto-initialization (convenient but less control):
25
26
```python
27
import pycuda.autoinit # Automatically initializes CUDA context
28
```
29
30
Kernel compilation:
31
32
```python
33
from pycuda.compiler import SourceModule
34
```
35
36
## Basic Usage
37
38
```python
39
import pycuda.driver as cuda
40
import pycuda.autoinit
41
import pycuda.gpuarray as gpuarray
42
import numpy as np
43
44
# Create GPU array from NumPy array
45
cpu_array = np.array([1, 2, 3, 4, 5], dtype=np.float32)
46
gpu_array = gpuarray.to_gpu(cpu_array)
47
48
# Perform operations on GPU
49
result = gpu_array * 2.0
50
51
# Copy result back to CPU
52
cpu_result = result.get()
53
print(cpu_result) # [2. 4. 6. 8. 10.]
54
55
# Manual kernel example
56
kernel_code = """
57
__global__ void double_array(float *a, int n) {
58
int idx = blockIdx.x * blockDim.x + threadIdx.x;
59
if (idx < n) {
60
a[idx] = a[idx] * 2.0;
61
}
62
}
63
"""
64
65
# Compile and run kernel
66
mod = SourceModule(kernel_code)
67
double_func = mod.get_function("double_array")
68
69
# Execute kernel
70
block_size = 256
71
grid_size = (len(cpu_array) + block_size - 1) // block_size
72
double_func(gpu_array, np.int32(len(cpu_array)),
73
block=(block_size, 1, 1), grid=(grid_size, 1))
74
```
75
76
## Architecture
77
78
PyCUDA's layered architecture provides both low-level control and high-level convenience:
79
80
- **Driver Layer**: Direct access to CUDA driver API with Pythonic error handling and memory management
81
- **Compiler Layer**: Dynamic CUDA kernel compilation and module management with caching
82
- **GPUArray Layer**: NumPy-like interface for GPU arrays with automatic memory management
83
- **Algorithm Layer**: Pre-built kernels for common operations (elementwise, reduction, scan)
84
- **Utility Layer**: Helper functions, memory pools, and device characterization tools
85
86
This design enables everything from simple array operations to complex custom kernel development, with automatic resource cleanup and comprehensive error checking throughout.
87
88
## Capabilities
89
90
### Driver API
91
92
Low-level CUDA driver API access providing direct control over contexts, devices, memory, streams, and events. This forms the foundation for all GPU operations.
93
94
```python { .api }
95
def init(flags: int = 0) -> None: ...
96
def mem_alloc(size: int) -> DeviceAllocation: ...
97
def mem_get_info() -> tuple[int, int]: ...
98
def memcpy_htod(dest: DeviceAllocation, src) -> None: ...
99
def memcpy_dtoh(dest, src: DeviceAllocation) -> None: ...
100
```
101
102
[Driver API](./driver-api.md)
103
104
### GPU Arrays
105
106
High-level NumPy-like interface for GPU arrays supporting arithmetic operations, slicing, broadcasting, and seamless interoperability with NumPy arrays.
107
108
```python { .api }
109
class GPUArray:
110
def __init__(self, shape, dtype, allocator=None): ...
111
def get(self) -> np.ndarray: ...
112
def set(self, ary: np.ndarray) -> None: ...
113
def __add__(self, other): ...
114
def __mul__(self, other): ...
115
```
116
117
[GPU Arrays](./gpu-arrays.md)
118
119
### Kernel Compilation
120
121
Dynamic CUDA kernel compilation with source code generation, caching, and module management for both inline and file-based CUDA source code.
122
123
```python { .api }
124
class SourceModule:
125
def __init__(self, source: str, **kwargs): ...
126
def get_function(self, name: str) -> Function: ...
127
def get_global(self, name: str) -> tuple[DeviceAllocation, int]: ...
128
```
129
130
[Kernel Compilation](./kernel-compilation.md)
131
132
### Algorithm Kernels
133
134
Pre-built, optimized kernels for common parallel operations including element-wise operations, reductions, and prefix scans with automatic type handling.
135
136
```python { .api }
137
class ElementwiseKernel:
138
def __init__(self, arguments: str, operation: str, **kwargs): ...
139
def __call__(self, *args, **kwargs): ...
140
141
class ReductionKernel:
142
def __init__(self, dtype, neutral: str, reduce_expr: str, **kwargs): ...
143
def __call__(self, gpu_array): ...
144
```
145
146
[Algorithm Kernels](./algorithm-kernels.md)
147
148
### Math Functions
149
150
CUDA math function wrappers providing GPU-accelerated mathematical operations for arrays including trigonometric, exponential, and logarithmic functions.
151
152
```python { .api }
153
def sin(array, **kwargs): ...
154
def cos(array, **kwargs): ...
155
def exp(array, **kwargs): ...
156
def log(array, **kwargs): ...
157
def sqrt(array, **kwargs): ...
158
```
159
160
[Math Functions](./math-functions.md)
161
162
### Random Number Generation
163
164
GPU-accelerated random number generation with support for various distributions and reproducible seeding for scientific computing applications.
165
166
```python { .api }
167
def rand(shape, dtype=np.float32, stream=None): ...
168
def seed_getter_uniform(n: int): ...
169
def seed_getter_unique(n: int): ...
170
```
171
172
[Random Numbers](./random-numbers.md)
173
174
### OpenGL Interoperability
175
176
Integration with OpenGL for graphics programming, allowing sharing of buffer objects and textures between CUDA and OpenGL contexts.
177
178
```python { .api }
179
def init() -> None: ...
180
def make_context(device: Device) -> Context: ...
181
class BufferObject: ...
182
class RegisteredBuffer: ...
183
```
184
185
[OpenGL Integration](./opengl-integration.md)
186
187
## Common Types
188
189
```python { .api }
190
class Device:
191
def count() -> int: ...
192
def get_device(device_no: int) -> Device: ...
193
def compute_capability() -> tuple[int, int]: ...
194
def name() -> str: ...
195
196
class Context:
197
def __init__(self, device: Device, flags: int = 0): ...
198
def push(self) -> None: ...
199
def pop(self) -> Context: ...
200
def get_device() -> Device: ...
201
202
class DeviceAllocation:
203
def __int__(self) -> int: ...
204
def __len__(self) -> int: ...
205
206
class Function:
207
def __call__(self, *args, **kwargs) -> None: ...
208
def prepare(self, arg_types) -> PreparedFunction: ...
209
210
class Stream:
211
def __init__(self, flags: int = 0): ...
212
def synchronize(self) -> None: ...
213
def is_done() -> bool: ...
214
215
class Event:
216
def __init__(self, flags: int = 0): ...
217
def record(self, stream: Stream = None) -> None: ...
218
def synchronize(self) -> None: ...
219
def query() -> bool: ...
220
def time_since(self, start_event: Event) -> float: ...
221
```