tessl/pypi-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Overall
score

69%

Evaluation — 69%

↑ 1.33x

Agent success when using this tile

Overview

Eval results

Files

Attention Backend Benchmark Tool

Name: tessl/pypi-vllm
Author: tessl

A Python utility that benchmarks LLM inference performance across different attention implementations to help users select the optimal backend for their hardware and workload.

Capabilities

Backend Initialization

It initializes an inference engine with a specified attention backend @test
It handles initialization with default backend when none is specified @test

Performance Comparison

It runs the same prompt through different attention backends and compares outputs @test
It measures generation time for a given backend configuration @test

Implementation

@generates

The implementation should:

Initialize inference engines with different attention backend configurations
Run inference workloads and measure performance metrics
Compare results across backends

API

from typing import Optional, List, Dict
import time

class AttentionBackendBenchmark:
    """
    Benchmarks LLM inference performance across different attention backends.
    """

    def __init__(self, model: str):
        """
        Initialize the benchmark tool with a model.

        Args:
            model: Model name or path to use for benchmarking
        """
        pass

    def run_with_backend(
        self,
        prompt: str,
        attention_backend: Optional[str] = None,
        max_tokens: int = 100,
        temperature: float = 0.0
    ) -> Dict[str, any]:
        """
        Run inference with a specific attention backend and return results with timing.

        Args:
            prompt: Input prompt text
            attention_backend: Attention backend name (e.g., "FLASH_ATTN", "FLASHINFER", "XFORMERS")
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature

        Returns:
            Dictionary with keys: 'output' (generated text), 'time' (generation time in seconds),
            'backend' (backend used)
        """
        pass

    def compare_backends(
        self,
        prompt: str,
        backends: List[str],
        max_tokens: int = 100
    ) -> List[Dict[str, any]]:
        """
        Compare multiple backends on the same prompt.

        Args:
            prompt: Input prompt text
            backends: List of backend names to compare
            max_tokens: Maximum tokens to generate

        Returns:
            List of result dictionaries, one per backend
        """
        pass

Dependencies { .dependencies }

vllm { .dependency }

Provides high-throughput LLM inference with custom attention mechanisms.

Install with Tessl CLI

npx tessl i tessl/pypi-vllm