Python wrapper for Nvidia CUDA parallel computation API with object cleanup, automatic error checking, and convenient abstractions.
npx @tessl/cli install tessl/pypi-pycuda@2025.1.0A comprehensive Python wrapper for Nvidia's CUDA parallel computation API that provides Pythonic access to GPU computing capabilities. PyCUDA offers object cleanup tied to object lifetime (RAII pattern), automatic error checking that translates all CUDA errors into Python exceptions, and convenient abstractions like GPUArray for GPU memory management.
pip install pycudaimport pycuda.driver as cudaGPU array operations:
import pycuda.gpuarray as gpuarrayAuto-initialization (convenient but less control):
import pycuda.autoinit # Automatically initializes CUDA contextKernel compilation:
from pycuda.compiler import SourceModuleimport pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
# Create GPU array from NumPy array
cpu_array = np.array([1, 2, 3, 4, 5], dtype=np.float32)
gpu_array = gpuarray.to_gpu(cpu_array)
# Perform operations on GPU
result = gpu_array * 2.0
# Copy result back to CPU
cpu_result = result.get()
print(cpu_result) # [2. 4. 6. 8. 10.]
# Manual kernel example
kernel_code = """
__global__ void double_array(float *a, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
a[idx] = a[idx] * 2.0;
}
}
"""
# Compile and run kernel
mod = SourceModule(kernel_code)
double_func = mod.get_function("double_array")
# Execute kernel
block_size = 256
grid_size = (len(cpu_array) + block_size - 1) // block_size
double_func(gpu_array, np.int32(len(cpu_array)),
block=(block_size, 1, 1), grid=(grid_size, 1))PyCUDA's layered architecture provides both low-level control and high-level convenience:
This design enables everything from simple array operations to complex custom kernel development, with automatic resource cleanup and comprehensive error checking throughout.
Low-level CUDA driver API access providing direct control over contexts, devices, memory, streams, and events. This forms the foundation for all GPU operations.
def init(flags: int = 0) -> None: ...
def mem_alloc(size: int) -> DeviceAllocation: ...
def mem_get_info() -> tuple[int, int]: ...
def memcpy_htod(dest: DeviceAllocation, src) -> None: ...
def memcpy_dtoh(dest, src: DeviceAllocation) -> None: ...High-level NumPy-like interface for GPU arrays supporting arithmetic operations, slicing, broadcasting, and seamless interoperability with NumPy arrays.
class GPUArray:
def __init__(self, shape, dtype, allocator=None): ...
def get(self) -> np.ndarray: ...
def set(self, ary: np.ndarray) -> None: ...
def __add__(self, other): ...
def __mul__(self, other): ...Dynamic CUDA kernel compilation with source code generation, caching, and module management for both inline and file-based CUDA source code.
class SourceModule:
def __init__(self, source: str, **kwargs): ...
def get_function(self, name: str) -> Function: ...
def get_global(self, name: str) -> tuple[DeviceAllocation, int]: ...Pre-built, optimized kernels for common parallel operations including element-wise operations, reductions, and prefix scans with automatic type handling.
class ElementwiseKernel:
def __init__(self, arguments: str, operation: str, **kwargs): ...
def __call__(self, *args, **kwargs): ...
class ReductionKernel:
def __init__(self, dtype, neutral: str, reduce_expr: str, **kwargs): ...
def __call__(self, gpu_array): ...CUDA math function wrappers providing GPU-accelerated mathematical operations for arrays including trigonometric, exponential, and logarithmic functions.
def sin(array, **kwargs): ...
def cos(array, **kwargs): ...
def exp(array, **kwargs): ...
def log(array, **kwargs): ...
def sqrt(array, **kwargs): ...GPU-accelerated random number generation with support for various distributions and reproducible seeding for scientific computing applications.
def rand(shape, dtype=np.float32, stream=None): ...
def seed_getter_uniform(n: int): ...
def seed_getter_unique(n: int): ...Integration with OpenGL for graphics programming, allowing sharing of buffer objects and textures between CUDA and OpenGL contexts.
def init() -> None: ...
def make_context(device: Device) -> Context: ...
class BufferObject: ...
class RegisteredBuffer: ...class Device:
def count() -> int: ...
def get_device(device_no: int) -> Device: ...
def compute_capability() -> tuple[int, int]: ...
def name() -> str: ...
class Context:
def __init__(self, device: Device, flags: int = 0): ...
def push(self) -> None: ...
def pop(self) -> Context: ...
def get_device() -> Device: ...
class DeviceAllocation:
def __int__(self) -> int: ...
def __len__(self) -> int: ...
class Function:
def __call__(self, *args, **kwargs) -> None: ...
def prepare(self, arg_types) -> PreparedFunction: ...
class Stream:
def __init__(self, flags: int = 0): ...
def synchronize(self) -> None: ...
def is_done() -> bool: ...
class Event:
def __init__(self, flags: int = 0): ...
def record(self, stream: Stream = None) -> None: ...
def synchronize(self) -> None: ...
def query() -> bool: ...
def time_since(self, start_event: Event) -> float: ...