CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Overall
score

69%

Evaluation69%

1.33x

Agent success when using this tile

Overview
Eval results
Files

task.mdevals/scenario-10/

Attention Backend Benchmark Tool

A Python utility that benchmarks LLM inference performance across different attention implementations to help users select the optimal backend for their hardware and workload.

Capabilities

Backend Initialization

  • It initializes an inference engine with a specified attention backend @test
  • It handles initialization with default backend when none is specified @test

Performance Comparison

  • It runs the same prompt through different attention backends and compares outputs @test
  • It measures generation time for a given backend configuration @test

Implementation

@generates

The implementation should:

  1. Initialize inference engines with different attention backend configurations
  2. Run inference workloads and measure performance metrics
  3. Compare results across backends

API

from typing import Optional, List, Dict
import time

class AttentionBackendBenchmark:
    """
    Benchmarks LLM inference performance across different attention backends.
    """

    def __init__(self, model: str):
        """
        Initialize the benchmark tool with a model.

        Args:
            model: Model name or path to use for benchmarking
        """
        pass

    def run_with_backend(
        self,
        prompt: str,
        attention_backend: Optional[str] = None,
        max_tokens: int = 100,
        temperature: float = 0.0
    ) -> Dict[str, any]:
        """
        Run inference with a specific attention backend and return results with timing.

        Args:
            prompt: Input prompt text
            attention_backend: Attention backend name (e.g., "FLASH_ATTN", "FLASHINFER", "XFORMERS")
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature

        Returns:
            Dictionary with keys: 'output' (generated text), 'time' (generation time in seconds),
            'backend' (backend used)
        """
        pass

    def compare_backends(
        self,
        prompt: str,
        backends: List[str],
        max_tokens: int = 100
    ) -> List[Dict[str, any]]:
        """
        Compare multiple backends on the same prompt.

        Args:
            prompt: Input prompt text
            backends: List of backend names to compare
            max_tokens: Maximum tokens to generate

        Returns:
            List of result dictionaries, one per backend
        """
        pass

Dependencies { .dependencies }

vllm { .dependency }

Provides high-throughput LLM inference with custom attention mechanisms.

Install with Tessl CLI

npx tessl i tessl/pypi-vllm

tile.json