or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-pycuda

Python wrapper for Nvidia CUDA parallel computation API with object cleanup, automatic error checking, and convenient abstractions.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pycuda@2025.1.x

To install, run

npx @tessl/cli install tessl/pypi-pycuda@2025.1.0

0

# PyCUDA

1

2

A comprehensive Python wrapper for Nvidia's CUDA parallel computation API that provides Pythonic access to GPU computing capabilities. PyCUDA offers object cleanup tied to object lifetime (RAII pattern), automatic error checking that translates all CUDA errors into Python exceptions, and convenient abstractions like GPUArray for GPU memory management.

3

4

## Package Information

5

6

- **Package Name**: pycuda

7

- **Language**: Python with C++ extensions

8

- **Installation**: `pip install pycuda`

9

- **Documentation**: https://documen.tician.de/pycuda

10

- **License**: MIT

11

12

## Core Imports

13

14

```python

15

import pycuda.driver as cuda

16

```

17

18

GPU array operations:

19

20

```python

21

import pycuda.gpuarray as gpuarray

22

```

23

24

Auto-initialization (convenient but less control):

25

26

```python

27

import pycuda.autoinit # Automatically initializes CUDA context

28

```

29

30

Kernel compilation:

31

32

```python

33

from pycuda.compiler import SourceModule

34

```

35

36

## Basic Usage

37

38

```python

39

import pycuda.driver as cuda

40

import pycuda.autoinit

41

import pycuda.gpuarray as gpuarray

42

import numpy as np

43

44

# Create GPU array from NumPy array

45

cpu_array = np.array([1, 2, 3, 4, 5], dtype=np.float32)

46

gpu_array = gpuarray.to_gpu(cpu_array)

47

48

# Perform operations on GPU

49

result = gpu_array * 2.0

50

51

# Copy result back to CPU

52

cpu_result = result.get()

53

print(cpu_result) # [2. 4. 6. 8. 10.]

54

55

# Manual kernel example

56

kernel_code = """

57

__global__ void double_array(float *a, int n) {

58

int idx = blockIdx.x * blockDim.x + threadIdx.x;

59

if (idx < n) {

60

a[idx] = a[idx] * 2.0;

61

}

62

}

63

"""

64

65

# Compile and run kernel

66

mod = SourceModule(kernel_code)

67

double_func = mod.get_function("double_array")

68

69

# Execute kernel

70

block_size = 256

71

grid_size = (len(cpu_array) + block_size - 1) // block_size

72

double_func(gpu_array, np.int32(len(cpu_array)),

73

block=(block_size, 1, 1), grid=(grid_size, 1))

74

```

75

76

## Architecture

77

78

PyCUDA's layered architecture provides both low-level control and high-level convenience:

79

80

- **Driver Layer**: Direct access to CUDA driver API with Pythonic error handling and memory management

81

- **Compiler Layer**: Dynamic CUDA kernel compilation and module management with caching

82

- **GPUArray Layer**: NumPy-like interface for GPU arrays with automatic memory management

83

- **Algorithm Layer**: Pre-built kernels for common operations (elementwise, reduction, scan)

84

- **Utility Layer**: Helper functions, memory pools, and device characterization tools

85

86

This design enables everything from simple array operations to complex custom kernel development, with automatic resource cleanup and comprehensive error checking throughout.

87

88

## Capabilities

89

90

### Driver API

91

92

Low-level CUDA driver API access providing direct control over contexts, devices, memory, streams, and events. This forms the foundation for all GPU operations.

93

94

```python { .api }

95

def init(flags: int = 0) -> None: ...

96

def mem_alloc(size: int) -> DeviceAllocation: ...

97

def mem_get_info() -> tuple[int, int]: ...

98

def memcpy_htod(dest: DeviceAllocation, src) -> None: ...

99

def memcpy_dtoh(dest, src: DeviceAllocation) -> None: ...

100

```

101

102

[Driver API](./driver-api.md)

103

104

### GPU Arrays

105

106

High-level NumPy-like interface for GPU arrays supporting arithmetic operations, slicing, broadcasting, and seamless interoperability with NumPy arrays.

107

108

```python { .api }

109

class GPUArray:

110

def __init__(self, shape, dtype, allocator=None): ...

111

def get(self) -> np.ndarray: ...

112

def set(self, ary: np.ndarray) -> None: ...

113

def __add__(self, other): ...

114

def __mul__(self, other): ...

115

```

116

117

[GPU Arrays](./gpu-arrays.md)

118

119

### Kernel Compilation

120

121

Dynamic CUDA kernel compilation with source code generation, caching, and module management for both inline and file-based CUDA source code.

122

123

```python { .api }

124

class SourceModule:

125

def __init__(self, source: str, **kwargs): ...

126

def get_function(self, name: str) -> Function: ...

127

def get_global(self, name: str) -> tuple[DeviceAllocation, int]: ...

128

```

129

130

[Kernel Compilation](./kernel-compilation.md)

131

132

### Algorithm Kernels

133

134

Pre-built, optimized kernels for common parallel operations including element-wise operations, reductions, and prefix scans with automatic type handling.

135

136

```python { .api }

137

class ElementwiseKernel:

138

def __init__(self, arguments: str, operation: str, **kwargs): ...

139

def __call__(self, *args, **kwargs): ...

140

141

class ReductionKernel:

142

def __init__(self, dtype, neutral: str, reduce_expr: str, **kwargs): ...

143

def __call__(self, gpu_array): ...

144

```

145

146

[Algorithm Kernels](./algorithm-kernels.md)

147

148

### Math Functions

149

150

CUDA math function wrappers providing GPU-accelerated mathematical operations for arrays including trigonometric, exponential, and logarithmic functions.

151

152

```python { .api }

153

def sin(array, **kwargs): ...

154

def cos(array, **kwargs): ...

155

def exp(array, **kwargs): ...

156

def log(array, **kwargs): ...

157

def sqrt(array, **kwargs): ...

158

```

159

160

[Math Functions](./math-functions.md)

161

162

### Random Number Generation

163

164

GPU-accelerated random number generation with support for various distributions and reproducible seeding for scientific computing applications.

165

166

```python { .api }

167

def rand(shape, dtype=np.float32, stream=None): ...

168

def seed_getter_uniform(n: int): ...

169

def seed_getter_unique(n: int): ...

170

```

171

172

[Random Numbers](./random-numbers.md)

173

174

### OpenGL Interoperability

175

176

Integration with OpenGL for graphics programming, allowing sharing of buffer objects and textures between CUDA and OpenGL contexts.

177

178

```python { .api }

179

def init() -> None: ...

180

def make_context(device: Device) -> Context: ...

181

class BufferObject: ...

182

class RegisteredBuffer: ...

183

```

184

185

[OpenGL Integration](./opengl-integration.md)

186

187

## Common Types

188

189

```python { .api }

190

class Device:

191

def count() -> int: ...

192

def get_device(device_no: int) -> Device: ...

193

def compute_capability() -> tuple[int, int]: ...

194

def name() -> str: ...

195

196

class Context:

197

def __init__(self, device: Device, flags: int = 0): ...

198

def push(self) -> None: ...

199

def pop(self) -> Context: ...

200

def get_device() -> Device: ...

201

202

class DeviceAllocation:

203

def __int__(self) -> int: ...

204

def __len__(self) -> int: ...

205

206

class Function:

207

def __call__(self, *args, **kwargs) -> None: ...

208

def prepare(self, arg_types) -> PreparedFunction: ...

209

210

class Stream:

211

def __init__(self, flags: int = 0): ...

212

def synchronize(self) -> None: ...

213

def is_done() -> bool: ...

214

215

class Event:

216

def __init__(self, flags: int = 0): ...

217

def record(self, stream: Stream = None) -> None: ...

218

def synchronize(self) -> None: ...

219

def query() -> bool: ...

220

def time_since(self, start_event: Event) -> float: ...

221

```