A high-throughput and memory-efficient inference and serving engine for LLMs
Overall
score
69%
Evaluation — 69%
↑ 1.33xAgent success when using this tile
Build a text generation service that can dynamically serve multiple fine-tuned model variants using adapter modules. The service should accept generation requests that can optionally specify which adapter to use, allowing different users or applications to access specialized model behaviors without loading separate models.
The service must support:
The service should support:
Create test file test_lora_service.py with the following test cases:
@generates
class MultiAdapterService:
"""
A text generation service supporting multiple adapter modules.
"""
def __init__(
self,
model_name: str,
enable_adapters: bool = True,
max_adapters: int = 4,
max_adapter_rank: int = 64
):
"""
Initialize the multi-adapter service.
Args:
model_name: Name or path of the base model
enable_adapters: Whether to enable adapter support
max_adapters: Maximum number of concurrent adapters
max_adapter_rank: Maximum rank for adapters
"""
pass
def generate(
self,
prompt: str,
adapter_name: str | None = None,
adapter_path: str | None = None,
max_tokens: int = 100
) -> str:
"""
Generate text completion for the given prompt.
Args:
prompt: Input text prompt
adapter_name: Optional unique identifier for the adapter
adapter_path: Optional filesystem path to the adapter module
max_tokens: Maximum number of tokens to generate
Returns:
Generated text completion
"""
passProvides high-performance LLM inference with adapter support.
@satisfied-by
Install with Tessl CLI
npx tessl i tessl/pypi-vllmdocs
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10