or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

algorithm-kernels.mddriver-api.mdgpu-arrays.mdindex.mdkernel-compilation.mdmath-functions.mdopengl-integration.mdrandom-numbers.md
tile.json

tessl/pypi-pycuda

Python wrapper for Nvidia CUDA parallel computation API with object cleanup, automatic error checking, and convenient abstractions.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pycuda@2025.1.x

To install, run

npx @tessl/cli install tessl/pypi-pycuda@2025.1.0

index.mddocs/

PyCUDA

A comprehensive Python wrapper for Nvidia's CUDA parallel computation API that provides Pythonic access to GPU computing capabilities. PyCUDA offers object cleanup tied to object lifetime (RAII pattern), automatic error checking that translates all CUDA errors into Python exceptions, and convenient abstractions like GPUArray for GPU memory management.

Package Information

  • Package Name: pycuda
  • Language: Python with C++ extensions
  • Installation: pip install pycuda
  • Documentation: https://documen.tician.de/pycuda
  • License: MIT

Core Imports

import pycuda.driver as cuda

GPU array operations:

import pycuda.gpuarray as gpuarray

Auto-initialization (convenient but less control):

import pycuda.autoinit  # Automatically initializes CUDA context

Kernel compilation:

from pycuda.compiler import SourceModule

Basic Usage

import pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np

# Create GPU array from NumPy array
cpu_array = np.array([1, 2, 3, 4, 5], dtype=np.float32)
gpu_array = gpuarray.to_gpu(cpu_array)

# Perform operations on GPU
result = gpu_array * 2.0

# Copy result back to CPU
cpu_result = result.get()
print(cpu_result)  # [2. 4. 6. 8. 10.]

# Manual kernel example
kernel_code = """
__global__ void double_array(float *a, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        a[idx] = a[idx] * 2.0;
    }
}
"""

# Compile and run kernel
mod = SourceModule(kernel_code)
double_func = mod.get_function("double_array")

# Execute kernel
block_size = 256
grid_size = (len(cpu_array) + block_size - 1) // block_size
double_func(gpu_array, np.int32(len(cpu_array)), 
           block=(block_size, 1, 1), grid=(grid_size, 1))

Architecture

PyCUDA's layered architecture provides both low-level control and high-level convenience:

  • Driver Layer: Direct access to CUDA driver API with Pythonic error handling and memory management
  • Compiler Layer: Dynamic CUDA kernel compilation and module management with caching
  • GPUArray Layer: NumPy-like interface for GPU arrays with automatic memory management
  • Algorithm Layer: Pre-built kernels for common operations (elementwise, reduction, scan)
  • Utility Layer: Helper functions, memory pools, and device characterization tools

This design enables everything from simple array operations to complex custom kernel development, with automatic resource cleanup and comprehensive error checking throughout.

Capabilities

Driver API

Low-level CUDA driver API access providing direct control over contexts, devices, memory, streams, and events. This forms the foundation for all GPU operations.

def init(flags: int = 0) -> None: ...
def mem_alloc(size: int) -> DeviceAllocation: ...
def mem_get_info() -> tuple[int, int]: ...
def memcpy_htod(dest: DeviceAllocation, src) -> None: ...
def memcpy_dtoh(dest, src: DeviceAllocation) -> None: ...

Driver API

GPU Arrays

High-level NumPy-like interface for GPU arrays supporting arithmetic operations, slicing, broadcasting, and seamless interoperability with NumPy arrays.

class GPUArray:
    def __init__(self, shape, dtype, allocator=None): ...
    def get(self) -> np.ndarray: ...
    def set(self, ary: np.ndarray) -> None: ...
    def __add__(self, other): ...
    def __mul__(self, other): ...

GPU Arrays

Kernel Compilation

Dynamic CUDA kernel compilation with source code generation, caching, and module management for both inline and file-based CUDA source code.

class SourceModule:
    def __init__(self, source: str, **kwargs): ...
    def get_function(self, name: str) -> Function: ...
    def get_global(self, name: str) -> tuple[DeviceAllocation, int]: ...

Kernel Compilation

Algorithm Kernels

Pre-built, optimized kernels for common parallel operations including element-wise operations, reductions, and prefix scans with automatic type handling.

class ElementwiseKernel:
    def __init__(self, arguments: str, operation: str, **kwargs): ...
    def __call__(self, *args, **kwargs): ...

class ReductionKernel:
    def __init__(self, dtype, neutral: str, reduce_expr: str, **kwargs): ...
    def __call__(self, gpu_array): ...

Algorithm Kernels

Math Functions

CUDA math function wrappers providing GPU-accelerated mathematical operations for arrays including trigonometric, exponential, and logarithmic functions.

def sin(array, **kwargs): ...
def cos(array, **kwargs): ...
def exp(array, **kwargs): ...
def log(array, **kwargs): ...
def sqrt(array, **kwargs): ...

Math Functions

Random Number Generation

GPU-accelerated random number generation with support for various distributions and reproducible seeding for scientific computing applications.

def rand(shape, dtype=np.float32, stream=None): ...
def seed_getter_uniform(n: int): ...
def seed_getter_unique(n: int): ...

Random Numbers

OpenGL Interoperability

Integration with OpenGL for graphics programming, allowing sharing of buffer objects and textures between CUDA and OpenGL contexts.

def init() -> None: ...
def make_context(device: Device) -> Context: ...
class BufferObject: ...
class RegisteredBuffer: ...

OpenGL Integration

Common Types

class Device:
    def count() -> int: ...
    def get_device(device_no: int) -> Device: ...
    def compute_capability() -> tuple[int, int]: ...
    def name() -> str: ...

class Context:
    def __init__(self, device: Device, flags: int = 0): ...
    def push(self) -> None: ...
    def pop(self) -> Context: ...
    def get_device() -> Device: ...

class DeviceAllocation:
    def __int__(self) -> int: ...
    def __len__(self) -> int: ...

class Function:
    def __call__(self, *args, **kwargs) -> None: ...
    def prepare(self, arg_types) -> PreparedFunction: ...

class Stream:
    def __init__(self, flags: int = 0): ...
    def synchronize(self) -> None: ...
    def is_done() -> bool: ...

class Event:
    def __init__(self, flags: int = 0): ...
    def record(self, stream: Stream = None) -> None: ...
    def synchronize(self) -> None: ...
    def query() -> bool: ...
    def time_since(self, start_event: Event) -> float: ...