Tessl Tile for pypi/vllm@0.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Workspace: tessl
Visibility: Public
Created: about 2 months ago
Last updated: about 2 months ago
Describes: pkg:pypi/vllm@0.10.x

To install, run

npx @tessl/cli install tessl/pypi-vllm@0.10.0

0
# vLLM
1

2
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It provides advanced techniques like continuous batching, paged attention, and GPU memory optimization to maximize inference speed while minimizing memory footprint, making it ideal for production deployments.
3

4
## Package Information
5

6
- **Package Name**: vllm
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install vllm`
10
- **Repository**: https://github.com/vllm-project/vllm
11

12
## Core Imports
13

14
```python
15
import vllm
16
```
17

18
For the main LLM interface:
19

20
```python
21
from vllm import LLM, SamplingParams
22
```
23

24
For output types:
25

26
```python
27
from vllm import (
28
    RequestOutput, CompletionOutput,
29
    EmbeddingRequestOutput, ClassificationRequestOutput,
30
    ScoringRequestOutput, PoolingRequestOutput
31
)
32
```
33

34
For async usage:
35

36
```python
37
from vllm import AsyncLLMEngine, AsyncEngineArgs
38
```
39

40
For additional parameters:
41

42
```python
43
from vllm import PoolingParams
44
from vllm.lora.request import LoRARequest
45
from vllm.sampling_params import GuidedDecodingParams, BeamSearchParams
46
```
47

48
## Basic Usage
49

50
```python
51
from vllm import LLM, SamplingParams
52

53
# Create an LLM instance
54
llm = LLM(model="microsoft/DialoGPT-medium")
55

56
# Define sampling parameters
57
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
58

59
# Generate text
60
prompts = [
61
    "Hello, my name is",
62
    "The capital of France is",
63
    "The future of AI is",
64
]
65

66
outputs = llm.generate(prompts, sampling_params)
67

68
# Print results
69
for output in outputs:
70
    prompt = output.prompt
71
    generated_text = output.outputs[0].text
72
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
73
```
74

75
## Architecture
76

77
vLLM's architecture is built around several key components:
78

79
- **LLM Class**: High-level interface providing simple text generation, chat completions, embeddings, and classification
80
- **Engine Layer**: Core inference engines (LLMEngine for sync, AsyncLLMEngine for async) managing request scheduling and batching
81
- **Model Executor**: Distributed model execution with support for tensor parallelism and pipeline parallelism
82
- **Memory Management**: Advanced memory optimization with paged attention and efficient KV cache management
83
- **Sampling System**: Sophisticated sampling strategies including beam search, guided decoding, and structured output generation
84

85
This design enables high-throughput serving with intelligent request batching, efficient resource utilization across multiple GPUs, and support for advanced inference techniques that are critical for production LLM deployments.
86

87
## Capabilities
88

89
### Text Generation
90

91
Primary text generation functionality with support for various prompt formats, sampling parameters, and generation strategies including beam search and guided decoding.
92

93
```python { .api }
94
class LLM:
95
    def generate(
96
        self,
97
        prompts: Union[PromptType, Sequence[PromptType]],
98
        sampling_params: Optional[Union[SamplingParams, Sequence[SamplingParams]]] = None,
99
        *,
100
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
101
        lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,
102
        priority: Optional[List[int]] = None
103
    ) -> List[RequestOutput]: ...
104
```
105

106
[Text Generation](./text-generation.md)
107

108
### Chat Completions
109

110
Conversational AI interface supporting chat templates, tool calling, and multi-turn conversations with proper message formatting and context management.
111

112
```python { .api }
113
class LLM:
114
    def chat(
115
        self,
116
        messages: Union[list[ChatCompletionMessageParam], list[list[ChatCompletionMessageParam]]],
117
        sampling_params: Optional[Union[SamplingParams, list[SamplingParams]]] = None,
118
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
119
        lora_request: Optional[LoRARequest] = None,
120
        chat_template: Optional[str] = None,
121
        chat_template_content_format: ChatTemplateContentFormatOption = "auto",
122
        add_generation_prompt: bool = True,
123
        continue_final_message: bool = False,
124
        tools: Optional[list[dict[str, Any]]] = None,
125
        chat_template_kwargs: Optional[dict[str, Any]] = None,
126
        mm_processor_kwargs: Optional[dict[str, Any]] = None
127
    ) -> list[RequestOutput]: ...
128
```
129

130
[Chat Completions](./chat-completions.md)
131

132
### Text Embeddings
133

134
Text encoding and embedding generation for semantic similarity, retrieval applications, and downstream NLP tasks with support for various pooling strategies.
135

136
```python { .api }
137
class LLM:
138
    def encode(
139
        self,
140
        prompts: Union[PromptType, Sequence[PromptType], DataPrompt],
141
        pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
142
        *,
143
        truncate_prompt_tokens: Optional[int] = None,
144
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
145
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
146
        pooling_task: PoolingTask = "encode",
147
        tokenization_kwargs: Optional[dict[str, Any]] = None
148
    ) -> list[PoolingRequestOutput]: ...
149
```
150

151
[Text Embeddings](./text-embeddings.md)
152

153
### Text Classification
154

155
Text classification functionality with predefined class labels, supporting various pooling strategies and confidence scoring for categorization tasks.
156

157
```python { .api }
158
class LLM:
159
    def classify(
160
        self,
161
        prompts: Union[PromptType, Sequence[PromptType]],
162
        *,
163
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
164
        pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
165
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
166
    ) -> list[ClassificationRequestOutput]: ...
167
```
168

169
[Text Classification](./text-classification.md)
170

171
### Text Scoring
172

173
Text similarity and likelihood scoring for comparing text pairs, ranking, and evaluation tasks with support for various scoring methods.
174

175
```python { .api }
176
class LLM:
177
    def score(
178
        self,
179
        data_1: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],
180
        data_2: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],
181
        /,
182
        *,
183
        truncate_prompt_tokens: Optional[int] = None,
184
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
185
        pooling_params: Optional[PoolingParams] = None,
186
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
187
    ) -> list[ScoringRequestOutput]: ...
188
```
189

190
[Text Scoring](./text-scoring.md)
191

192
### Embedding Generation
193

194
Specialized embedding generation method for encoding text into vector representations with automatic normalization and optimal pooling strategies.
195

196
```python { .api }
197
class LLM:
198
    def embed(
199
        self,
200
        prompts: Union[PromptType, Sequence[PromptType]],
201
        *,
202
        truncate_prompt_tokens: Optional[int] = None,
203
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
204
        pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
205
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
206
    ) -> list[EmbeddingRequestOutput]: ...
207
```
208

209
### Beam Search Generation
210

211
Advanced beam search generation for exploring multiple generation paths and finding high-quality outputs through systematic search.
212

213
```python { .api }
214
class LLM:
215
    def beam_search(
216
        self,
217
        prompts: Union[PromptType, Sequence[PromptType]],
218
        params: BeamSearchParams
219
    ) -> list[BeamSearchOutput]: ...
220
```
221

222
### Reward Modeling
223

224
Generate reward scores for text evaluation, preference learning, and RLHF applications.
225

226
```python { .api }
227
class LLM:
228
    def reward(
229
        self,
230
        prompts: Union[PromptType, Sequence[PromptType]],
231
        *,
232
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
233
        pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
234
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
235
    ) -> list[PoolingRequestOutput]: ...
236
```
237

238
### Asynchronous Inference
239

240
High-performance asynchronous inference engine for concurrent request handling, streaming responses, and integration with async frameworks.
241

242
```python { .api }
243
class AsyncLLMEngine:
244
    async def generate(
245
        self,
246
        prompt: Optional[str],
247
        sampling_params: SamplingParams,
248
        request_id: str,
249
        prompt_token_ids: Optional[List[int]] = None,
250
        lora_request: Optional[LoRARequest] = None
251
    ) -> AsyncIterator[RequestOutput]: ...
252
```
253

254
[Asynchronous Inference](./async-inference.md)
255

256
### Configuration and Engine Management
257

258
Comprehensive configuration system for model loading, distributed execution, memory management, and performance tuning across various deployment scenarios.
259

260
```python { .api }
261
class EngineArgs:
262
    model: str
263
    tokenizer: Optional[str] = None
264
    tokenizer_mode: str = "auto"
265
    trust_remote_code: bool = False
266
    tensor_parallel_size: int = 1
267
    dtype: str = "auto"
268
    quantization: Optional[str] = None
269
    max_model_len: Optional[int] = None
270
    gpu_memory_utilization: float = 0.9
271
    # ... many more configuration options
272
```
273

274
[Configuration](./configuration.md)
275

276
### Parameters and Data Types
277

278
Essential parameter classes and data types for controlling generation behavior, defining inputs and outputs, and managing model configurations.
279

280
```python { .api }
281
class SamplingParams:
282
    n: int = 1
283
    temperature: float = 1.0
284
    top_p: float = 1.0
285
    top_k: int = -1
286
    stop: Optional[Union[str, List[str]]] = None
287
    max_tokens: Optional[int] = None
288
    # ... many more sampling parameters
289
```
290

291
[Parameters and Types](./parameters-types.md)
292

293
### Tokenizer Management
294

295
Access and manage tokenizers with support for LoRA adapters and custom tokenization.
296

297
```python { .api }
298
class LLM:
299
    def get_tokenizer(self, lora_request: Optional[LoRARequest] = None) -> "PreTrainedTokenizerBase":
300
        """
301
        Get tokenizer with optional LoRA adapter support.
302

303
        Parameters:
304
        - lora_request: Optional LoRA adapter configuration
305

306
        Returns:
307
        Tokenizer instance configured for this model
308
        """
309

310
    def set_tokenizer(self, tokenizer: "PreTrainedTokenizerBase") -> None:
311
        """
312
        Set a custom tokenizer for this LLM instance.
313

314
        Parameters:
315
        - tokenizer: Custom tokenizer to use for this model
316
        """
317

318
    def get_default_sampling_params(self) -> SamplingParams:
319
        """
320
        Get default sampling parameters from model configuration.
321

322
        Returns:
323
        Default SamplingParams instance based on model config
324
        """
325
```
326

327
### Advanced Model Management
328

329
Direct model access, profiling, and distributed computing capabilities.
330

331
```python { .api }
332
class LLM:
333
    def collective_rpc(
334
        self,
335
        method: str,
336
        timeout: Optional[float] = None,
337
        args: tuple[Any, ...] = (),
338
        kwargs: Optional[dict[str, Any]] = None
339
    ) -> list[Any]:
340
        """
341
        Execute RPC calls on all model workers.
342

343
        Parameters:
344
        - method: Method name to call on workers
345
        - timeout: Optional timeout for RPC calls
346
        - args: Positional arguments for method
347
        - kwargs: Keyword arguments for method
348

349
        Returns:
350
        List of results from all workers
351
        """
352

353
    def apply_model(self, func: Callable) -> list[Any]:
354
        """
355
        Apply function directly to model in each worker.
356

357
        Parameters:
358
        - func: Function to apply to model instances
359

360
        Returns:
361
        Results from applying function to all model instances
362
        """
363

364
    def start_profile(self) -> None:
365
        """Start performance profiling for this LLM instance."""
366

367
    def stop_profile(self) -> None:
368
        """Stop performance profiling and save results."""
369

370
    def reset_prefix_cache(self, device: Optional[Union[str, int]] = None) -> None:
371
        """
372
        Reset prefix cache for memory optimization.
373

374
        Parameters:
375
        - device: Optional device specification for cache reset
376
        """
377
```
378

379
### Resource Management
380

381
Control engine sleep/wake states and retrieve metrics for monitoring.
382

383
```python { .api }
384
class LLM:
385
    def sleep(self, level: int = 1) -> None:
386
        """
387
        Put engine to sleep to free resources.
388

389
        Parameters:
390
        - level: Sleep level (1=light, 2=deep)
391
        """
392

393
    def wake_up(self, tags: Optional[list[str]] = None) -> None:
394
        """
395
        Wake up sleeping engine.
396

397
        Parameters:
398
        - tags: Optional tags for selective wake-up
399
        """
400

401
    def get_metrics(self) -> dict[str, Any]:
402
        """
403
        Get Prometheus metrics for monitoring (V1 engine only).
404

405
        Returns:
406
        Dictionary of metrics and values
407
        """
408
```
409

410
### Chat Message Processing
411

412
Preprocess chat messages into standardized prompt format.
413

414
```python { .api }
415
class LLM:
416
    def preprocess_chat(
417
        self,
418
        messages: List[ChatCompletionMessageParam],
419
        lora_request: Optional[LoRARequest] = None,
420
        chat_template: Optional[str] = None,
421
        chat_template_content_format: ChatTemplateContentFormatOption = "auto",
422
        add_generation_prompt: bool = True,
423
        continue_final_message: bool = False,
424
        tools: Optional[list[dict[str, Any]]] = None,
425
        chat_template_kwargs: Optional[dict[str, Any]] = None,
426
        mm_processor_kwargs: Optional[dict[str, Any]] = None
427
    ) -> "TokensPrompt":
428
        """
429
        Preprocess chat messages into TokensPrompt format.
430

431
        Parameters:
432
        - messages: List of chat completion messages
433
        - lora_request: Optional LoRA adapter configuration
434
        - chat_template: Optional custom chat template
435
        - chat_template_content_format: Content format option
436
        - add_generation_prompt: Whether to add generation prompt
437
        - continue_final_message: Whether to continue final message
438
        - tools: Optional list of available tools
439
        - chat_template_kwargs: Additional template arguments
440
        - mm_processor_kwargs: Multimodal processor arguments
441

442
        Returns:
443
        Preprocessed TokensPrompt ready for generation
444
        """
445
```
446

447
### Model Registry and Discovery
448

449
Model registry system for discovering supported model architectures, checking model capabilities, and managing model metadata.
450

451
```python { .api }
452
class ModelRegistry:
453
    @staticmethod
454
    def get_supported_archs() -> list[str]:
455
        """Get list of supported model architectures."""
456

457
    @staticmethod
458
    def get_supported_models() -> list[str]:
459
        """Get list of all supported model names."""
460

461
    @staticmethod
462
    def get_model_info(model_arch: str) -> dict[str, Any]:
463
        """
464
        Get detailed information about a model architecture.
465

466
        Parameters:
467
        - model_arch: Model architecture name
468

469
        Returns:
470
        Dictionary with model architecture details
471
        """
472

473
    @staticmethod
474
    def is_text_generation_model(model_arch: str) -> bool:
475
        """Check if model supports text generation."""
476

477
    @staticmethod
478
    def is_embedding_model(model_arch: str) -> bool:
479
        """Check if model supports embeddings."""
480

481
    @staticmethod
482
    def is_multimodal_model(model_arch: str) -> bool:
483
        """Check if model supports multimodal inputs."""
484
```
485

486
### Distributed Computing Utilities
487

488
Ray-based distributed computing initialization and management for multi-node inference deployments.
489

490
```python { .api }
491
def initialize_ray_cluster(
492
    parallel_config: ParallelConfig,
493
    engine_use_ray: bool = False,
494
    ray_address: Optional[str] = None
495
) -> None:
496
    """
497
    Initialize Ray cluster for distributed inference.
498

499
    Parameters:
500
    - parallel_config: Parallelism configuration
501
    - engine_use_ray: Whether engine uses Ray
502
    - ray_address: Ray cluster address
503
    """
504
```
505

506
### Command Line Interface
507

508
CLI entry point for vLLM server and utilities, providing OpenAI-compatible API server and benchmarking tools.
509

510
```bash
511
# Start OpenAI-compatible API server
512
vllm serve microsoft/DialoGPT-medium --host 0.0.0.0 --port 8000
513

514
# Run performance benchmarks
515
vllm benchmark --model microsoft/DialoGPT-medium --input-len 512 --output-len 128
516
```
517

518
### Version and Metadata
519

520
Package version information and backward compatibility utilities.
521

522
```python { .api }
523
__version__: str  # Package version string
524
__version_tuple__: Tuple[int, int, int]  # Version as tuple
525

526
def bc_linter_skip(func):
527
    """Skip backward compatibility linting for function."""
528

529
def bc_linter_include(func):
530
    """Include function in backward compatibility linting."""
531
```
532

533
## Types
534

535
```python { .api }
536
PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
537
SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
538
Sequence = Union[list, tuple]
539
PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]
540

541
class TextPrompt:
542
    prompt: str
543
    multi_modal_data: Optional[MultiModalDataDict] = None
544

545
class TokensPrompt:
546
    prompt_token_ids: list[int]
547
    multi_modal_data: Optional[MultiModalDataDict] = None
548

549
class EmbedsPrompt:
550
    embedding: list[float]
551
    multi_modal_data: Optional[MultiModalDataDict] = None
552

553
class RequestOutput:
554
    request_id: str
555
    prompt: Optional[str]
556
    prompt_token_ids: list[int]
557
    outputs: list[CompletionOutput]
558
    finished: bool
559
    metrics: Optional[RequestMetrics] = None
560
    lora_request: Optional[LoRARequest] = None
561

562
class CompletionOutput:
563
    index: int
564
    text: str
565
    token_ids: list[int]
566
    cumulative_logprob: Optional[float]
567
    logprobs: Optional[SampleLogprobs]
568
    finish_reason: Optional[str] = None
569
    stop_reason: Union[int, str, None] = None
570
    lora_request: Optional[LoRARequest] = None
571

572
class PoolingRequestOutput:
573
    id: str
574
    outputs: PoolingOutput
575
    prompt_token_ids: list[int]
576
    finished: bool
577

578
class EmbeddingRequestOutput:
579
    id: str
580
    outputs: EmbeddingOutput
581
    prompt_token_ids: list[int]
582
    finished: bool
583

584
class ClassificationRequestOutput:
585
    id: str
586
    outputs: ClassificationOutput
587
    prompt_token_ids: list[int]
588
    finished: bool
589

590
class ScoringRequestOutput:
591
    id: str
592
    outputs: ScoringOutput
593
    prompt_token_ids: list[int]
594
    finished: bool
595

596
class EmbeddingOutput:
597
    embedding: list[float]
598

599
class ClassificationOutput:
600
    probs: list[float]
601

602
class ScoringOutput:
603
    score: float
604

605
class BeamSearchOutput:
606
    sequences: list[BeamSearchSequence]
607
    finished: bool
608

609
class BeamSearchSequence:
610
    text: str
611
    token_ids: list[int]
612
    cumulative_logprob: float
613

614
class DataPrompt(TypedDict):
615
    data: Any
616
    data_format: str
617

618
class EmbedsPrompt(TypedDict):
619
    prompt_embeds: "torch.Tensor"
620
    cache_salt: NotRequired[str]
621

622
class ExplicitEncoderDecoderPrompt(TypedDict):
623
    encoder_prompt: Any
624
    decoder_prompt: Optional[Any]
625
    mm_processor_kwargs: NotRequired[dict[str, Any]]
626

627
# Enhanced TextPrompt with all fields
628
class TextPrompt(TypedDict):
629
    prompt: str
630
    multi_modal_data: Optional[MultiModalDataDict]
631
    multi_modal_uuids: NotRequired["MultiModalUUIDDict"]
632
    cache_salt: NotRequired[str]
633

634
# Enhanced TokensPrompt with all fields
635
class TokensPrompt(TypedDict):
636
    prompt_token_ids: list[int]
637
    prompt: NotRequired[str]
638
    token_type_ids: NotRequired[list[int]]
639
    multi_modal_data: Optional[MultiModalDataDict]
640
    multi_modal_uuids: NotRequired["MultiModalUUIDDict"]
641
    cache_salt: NotRequired[str]
642

643
# Core enums
644
class SamplingType(IntEnum):
645
    GREEDY = 0
646
    RANDOM = 1
647
    RANDOM_SEED = 2
648

649
class RequestOutputKind(Enum):
650
    CUMULATIVE = 0  # Return entire output so far
651
    DELTA = 1       # Return only deltas
652
    FINAL_ONLY = 2  # Do not return intermediate output
653

654
# Enhanced type aliases
655
PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt, ExplicitEncoderDecoderPrompt]
656
SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
657
PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]
658
ChatTemplateContentFormatOption = Literal["auto", "string", "openai"]
659

660
# Utility functions
661
def is_tokens_prompt(prompt: SingletonPrompt) -> "TypeIs[TokensPrompt]": ...
662
def is_embeds_prompt(prompt: SingletonPrompt) -> "TypeIs[EmbedsPrompt]": ...
663

664
# Version information
665
__version__: str
666
__version_tuple__: Tuple[int, int, int]
667

668
def bc_linter_skip(func):
669
    """Skip backward compatibility linting for function."""
670

671
def bc_linter_include(func):
672
    """Include function in backward compatibility linting."""
673
```