Python wrapper for Nvidia CUDA parallel computation API with object cleanup, automatic error checking, and convenient abstractions.
Build a Python module that performs element-wise mathematical operations on vectors using GPU acceleration with automatic memory management between host and device.
The module should provide a function that takes host (CPU) arrays and performs GPU-accelerated operations with automatic data transfer handling. The implementation should:
result[i] = a[i] * b[i] + c[i]The function should have the following signature:
[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], and [7.0, 8.0, 9.0], the result is [11.0, 18.0, 27.0] @test[2.5], [4.0], and [1.0], the result is [11.0] @test1.0, 2.0, and 3.0 respectively, all result elements are 5.0 @test@generates
import numpy as np
def fused_multiply_add(a: np.ndarray, b: np.ndarray, c: np.ndarray) -> np.ndarray:
"""
Performs element-wise fused multiply-add operation: result = a * b + c
Args:
a: First input array (float32)
b: Second input array (float32)
c: Third input array (float32)
Returns:
NumPy array containing the result of a * b + c
Note:
All input arrays must have the same length.
The computation is performed on the GPU with automatic memory transfers.
"""
passProvides GPU computing capabilities with automatic memory management features.
@satisfied-by
tessl i tessl/pypi-pycuda@2025.1.0docs
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10