or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/cupy-cuda101@9.6.x
tile.json

tessl/pypi-cupy-cuda101

tessl install tessl/pypi-cupy-cuda101@9.6.0

CuPy: NumPy & SciPy for GPU (CUDA 10.1 version)

Agent Success

Agent success rate when using this tile

87%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.19x

Baseline

Agent success rate without this tile

73%

task.mdevals/scenario-4/

GPU Matrix Computation Pipeline Optimization

Overview

Build a matrix computation pipeline that performs a sequence of operations repeatedly on GPU arrays. The pipeline should optimize performance by reducing kernel launch overhead for the repeated execution pattern.

Background

When the same sequence of GPU operations is executed repeatedly with minimal variations, there are opportunities to reduce overhead. Your task is to implement a computation pipeline that can efficiently handle repeated execution of a fixed sequence of matrix operations.

Requirements

Input Processing

Implement a function process_matrices(a, b, c, num_iterations) that:

  • Takes three GPU arrays a, b, and c (all 1000x1000 float32 matrices)
  • Performs the following sequence of operations for num_iterations times:
    1. Multiply matrix a by matrix b to get intermediate result temp1
    2. Add c to temp1 to get temp2
    3. Compute element-wise square of temp2 to get temp3
    4. Update a with the result: a = temp3
  • Returns the final value of a after all iterations

Performance Optimization

The implementation should minimize kernel launch overhead when executing the same operation sequence repeatedly. The operations form a fixed computation pattern that does not change between iterations.

Correctness

The function must produce mathematically correct results - the final matrix should match what would be obtained by executing the operations in sequence.

Dependencies { .dependencies }

cupy-cuda101 { .dependency }

Provides GPU-accelerated array operations and CUDA programming capabilities.

Test Cases

Test 1: Basic Functionality @test

File: test_matrix_pipeline.py { .test }

import cupy as cp
import numpy as np

def test_basic_pipeline():
    """Test that the pipeline produces correct results"""
    cp.random.seed(42)
    a = cp.random.rand(1000, 1000).astype(cp.float32)
    b = cp.random.rand(1000, 1000).astype(cp.float32)
    c = cp.random.rand(1000, 1000).astype(cp.float32)

    # Save initial values
    a_init = a.copy()
    b_init = b.copy()
    c_init = c.copy()

    result = process_matrices(a, b, c, num_iterations=5)

    # Verify result is not None
    assert result is not None

    # Verify result has correct shape
    assert result.shape == (1000, 1000)

    # Verify the computation by manually computing first iteration
    temp1 = cp.matmul(a_init, b_init)
    temp2 = temp1 + c_init
    temp3 = temp2 ** 2

    # Result after first iteration should have values derived from this
    # Just check that result contains finite values
    assert cp.all(cp.isfinite(result))

Test 2: Multiple Iterations @test

File: test_matrix_pipeline.py { .test }

def test_multiple_iterations():
    """Test that multiple iterations work correctly"""
    cp.random.seed(123)
    a = cp.random.rand(1000, 1000).astype(cp.float32)
    b = cp.random.rand(1000, 1000).astype(cp.float32)
    c = cp.random.rand(1000, 1000).astype(cp.float32)

    result = process_matrices(a, b, c, num_iterations=10)

    # Verify result shape and finiteness
    assert result.shape == (1000, 1000)
    assert cp.all(cp.isfinite(result))

    # Result should be different after 10 iterations
    # (values should have grown significantly)
    assert cp.max(result) > 1.0

Deliverables

  1. Implementation of process_matrices() function
  2. Test file test_matrix_pipeline.py with the provided test cases
  3. The implementation should handle the repeated execution pattern efficiently

Notes

  • Focus on correctness first, then optimize for the repeated execution pattern
  • All operations should execute on the GPU
  • The computation pattern is fixed and does not change between iterations