A high-throughput and memory-efficient inference and serving engine for LLMs
npx @tessl/cli install tessl/pypi-vllm@0.10.00
# vLLM
1
2
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It provides advanced techniques like continuous batching, paged attention, and GPU memory optimization to maximize inference speed while minimizing memory footprint, making it ideal for production deployments.
3
4
## Package Information
5
6
- **Package Name**: vllm
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install vllm`
10
- **Repository**: https://github.com/vllm-project/vllm
11
12
## Core Imports
13
14
```python
15
import vllm
16
```
17
18
For the main LLM interface:
19
20
```python
21
from vllm import LLM, SamplingParams
22
```
23
24
For output types:
25
26
```python
27
from vllm import (
28
RequestOutput, CompletionOutput,
29
EmbeddingRequestOutput, ClassificationRequestOutput,
30
ScoringRequestOutput, PoolingRequestOutput
31
)
32
```
33
34
For async usage:
35
36
```python
37
from vllm import AsyncLLMEngine, AsyncEngineArgs
38
```
39
40
For additional parameters:
41
42
```python
43
from vllm import PoolingParams
44
from vllm.lora.request import LoRARequest
45
from vllm.sampling_params import GuidedDecodingParams, BeamSearchParams
46
```
47
48
## Basic Usage
49
50
```python
51
from vllm import LLM, SamplingParams
52
53
# Create an LLM instance
54
llm = LLM(model="microsoft/DialoGPT-medium")
55
56
# Define sampling parameters
57
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
58
59
# Generate text
60
prompts = [
61
"Hello, my name is",
62
"The capital of France is",
63
"The future of AI is",
64
]
65
66
outputs = llm.generate(prompts, sampling_params)
67
68
# Print results
69
for output in outputs:
70
prompt = output.prompt
71
generated_text = output.outputs[0].text
72
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
73
```
74
75
## Architecture
76
77
vLLM's architecture is built around several key components:
78
79
- **LLM Class**: High-level interface providing simple text generation, chat completions, embeddings, and classification
80
- **Engine Layer**: Core inference engines (LLMEngine for sync, AsyncLLMEngine for async) managing request scheduling and batching
81
- **Model Executor**: Distributed model execution with support for tensor parallelism and pipeline parallelism
82
- **Memory Management**: Advanced memory optimization with paged attention and efficient KV cache management
83
- **Sampling System**: Sophisticated sampling strategies including beam search, guided decoding, and structured output generation
84
85
This design enables high-throughput serving with intelligent request batching, efficient resource utilization across multiple GPUs, and support for advanced inference techniques that are critical for production LLM deployments.
86
87
## Capabilities
88
89
### Text Generation
90
91
Primary text generation functionality with support for various prompt formats, sampling parameters, and generation strategies including beam search and guided decoding.
92
93
```python { .api }
94
class LLM:
95
def generate(
96
self,
97
prompts: Union[PromptType, Sequence[PromptType]],
98
sampling_params: Optional[Union[SamplingParams, Sequence[SamplingParams]]] = None,
99
*,
100
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
101
lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,
102
priority: Optional[List[int]] = None
103
) -> List[RequestOutput]: ...
104
```
105
106
[Text Generation](./text-generation.md)
107
108
### Chat Completions
109
110
Conversational AI interface supporting chat templates, tool calling, and multi-turn conversations with proper message formatting and context management.
111
112
```python { .api }
113
class LLM:
114
def chat(
115
self,
116
messages: Union[list[ChatCompletionMessageParam], list[list[ChatCompletionMessageParam]]],
117
sampling_params: Optional[Union[SamplingParams, list[SamplingParams]]] = None,
118
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
119
lora_request: Optional[LoRARequest] = None,
120
chat_template: Optional[str] = None,
121
chat_template_content_format: ChatTemplateContentFormatOption = "auto",
122
add_generation_prompt: bool = True,
123
continue_final_message: bool = False,
124
tools: Optional[list[dict[str, Any]]] = None,
125
chat_template_kwargs: Optional[dict[str, Any]] = None,
126
mm_processor_kwargs: Optional[dict[str, Any]] = None
127
) -> list[RequestOutput]: ...
128
```
129
130
[Chat Completions](./chat-completions.md)
131
132
### Text Embeddings
133
134
Text encoding and embedding generation for semantic similarity, retrieval applications, and downstream NLP tasks with support for various pooling strategies.
135
136
```python { .api }
137
class LLM:
138
def encode(
139
self,
140
prompts: Union[PromptType, Sequence[PromptType], DataPrompt],
141
pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
142
*,
143
truncate_prompt_tokens: Optional[int] = None,
144
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
145
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
146
pooling_task: PoolingTask = "encode",
147
tokenization_kwargs: Optional[dict[str, Any]] = None
148
) -> list[PoolingRequestOutput]: ...
149
```
150
151
[Text Embeddings](./text-embeddings.md)
152
153
### Text Classification
154
155
Text classification functionality with predefined class labels, supporting various pooling strategies and confidence scoring for categorization tasks.
156
157
```python { .api }
158
class LLM:
159
def classify(
160
self,
161
prompts: Union[PromptType, Sequence[PromptType]],
162
*,
163
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
164
pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
165
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
166
) -> list[ClassificationRequestOutput]: ...
167
```
168
169
[Text Classification](./text-classification.md)
170
171
### Text Scoring
172
173
Text similarity and likelihood scoring for comparing text pairs, ranking, and evaluation tasks with support for various scoring methods.
174
175
```python { .api }
176
class LLM:
177
def score(
178
self,
179
data_1: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],
180
data_2: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],
181
/,
182
*,
183
truncate_prompt_tokens: Optional[int] = None,
184
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
185
pooling_params: Optional[PoolingParams] = None,
186
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
187
) -> list[ScoringRequestOutput]: ...
188
```
189
190
[Text Scoring](./text-scoring.md)
191
192
### Embedding Generation
193
194
Specialized embedding generation method for encoding text into vector representations with automatic normalization and optimal pooling strategies.
195
196
```python { .api }
197
class LLM:
198
def embed(
199
self,
200
prompts: Union[PromptType, Sequence[PromptType]],
201
*,
202
truncate_prompt_tokens: Optional[int] = None,
203
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
204
pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
205
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
206
) -> list[EmbeddingRequestOutput]: ...
207
```
208
209
### Beam Search Generation
210
211
Advanced beam search generation for exploring multiple generation paths and finding high-quality outputs through systematic search.
212
213
```python { .api }
214
class LLM:
215
def beam_search(
216
self,
217
prompts: Union[PromptType, Sequence[PromptType]],
218
params: BeamSearchParams
219
) -> list[BeamSearchOutput]: ...
220
```
221
222
### Reward Modeling
223
224
Generate reward scores for text evaluation, preference learning, and RLHF applications.
225
226
```python { .api }
227
class LLM:
228
def reward(
229
self,
230
prompts: Union[PromptType, Sequence[PromptType]],
231
*,
232
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
233
pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,
234
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None
235
) -> list[PoolingRequestOutput]: ...
236
```
237
238
### Asynchronous Inference
239
240
High-performance asynchronous inference engine for concurrent request handling, streaming responses, and integration with async frameworks.
241
242
```python { .api }
243
class AsyncLLMEngine:
244
async def generate(
245
self,
246
prompt: Optional[str],
247
sampling_params: SamplingParams,
248
request_id: str,
249
prompt_token_ids: Optional[List[int]] = None,
250
lora_request: Optional[LoRARequest] = None
251
) -> AsyncIterator[RequestOutput]: ...
252
```
253
254
[Asynchronous Inference](./async-inference.md)
255
256
### Configuration and Engine Management
257
258
Comprehensive configuration system for model loading, distributed execution, memory management, and performance tuning across various deployment scenarios.
259
260
```python { .api }
261
class EngineArgs:
262
model: str
263
tokenizer: Optional[str] = None
264
tokenizer_mode: str = "auto"
265
trust_remote_code: bool = False
266
tensor_parallel_size: int = 1
267
dtype: str = "auto"
268
quantization: Optional[str] = None
269
max_model_len: Optional[int] = None
270
gpu_memory_utilization: float = 0.9
271
# ... many more configuration options
272
```
273
274
[Configuration](./configuration.md)
275
276
### Parameters and Data Types
277
278
Essential parameter classes and data types for controlling generation behavior, defining inputs and outputs, and managing model configurations.
279
280
```python { .api }
281
class SamplingParams:
282
n: int = 1
283
temperature: float = 1.0
284
top_p: float = 1.0
285
top_k: int = -1
286
stop: Optional[Union[str, List[str]]] = None
287
max_tokens: Optional[int] = None
288
# ... many more sampling parameters
289
```
290
291
[Parameters and Types](./parameters-types.md)
292
293
### Tokenizer Management
294
295
Access and manage tokenizers with support for LoRA adapters and custom tokenization.
296
297
```python { .api }
298
class LLM:
299
def get_tokenizer(self, lora_request: Optional[LoRARequest] = None) -> "PreTrainedTokenizerBase":
300
"""
301
Get tokenizer with optional LoRA adapter support.
302
303
Parameters:
304
- lora_request: Optional LoRA adapter configuration
305
306
Returns:
307
Tokenizer instance configured for this model
308
"""
309
310
def set_tokenizer(self, tokenizer: "PreTrainedTokenizerBase") -> None:
311
"""
312
Set a custom tokenizer for this LLM instance.
313
314
Parameters:
315
- tokenizer: Custom tokenizer to use for this model
316
"""
317
318
def get_default_sampling_params(self) -> SamplingParams:
319
"""
320
Get default sampling parameters from model configuration.
321
322
Returns:
323
Default SamplingParams instance based on model config
324
"""
325
```
326
327
### Advanced Model Management
328
329
Direct model access, profiling, and distributed computing capabilities.
330
331
```python { .api }
332
class LLM:
333
def collective_rpc(
334
self,
335
method: str,
336
timeout: Optional[float] = None,
337
args: tuple[Any, ...] = (),
338
kwargs: Optional[dict[str, Any]] = None
339
) -> list[Any]:
340
"""
341
Execute RPC calls on all model workers.
342
343
Parameters:
344
- method: Method name to call on workers
345
- timeout: Optional timeout for RPC calls
346
- args: Positional arguments for method
347
- kwargs: Keyword arguments for method
348
349
Returns:
350
List of results from all workers
351
"""
352
353
def apply_model(self, func: Callable) -> list[Any]:
354
"""
355
Apply function directly to model in each worker.
356
357
Parameters:
358
- func: Function to apply to model instances
359
360
Returns:
361
Results from applying function to all model instances
362
"""
363
364
def start_profile(self) -> None:
365
"""Start performance profiling for this LLM instance."""
366
367
def stop_profile(self) -> None:
368
"""Stop performance profiling and save results."""
369
370
def reset_prefix_cache(self, device: Optional[Union[str, int]] = None) -> None:
371
"""
372
Reset prefix cache for memory optimization.
373
374
Parameters:
375
- device: Optional device specification for cache reset
376
"""
377
```
378
379
### Resource Management
380
381
Control engine sleep/wake states and retrieve metrics for monitoring.
382
383
```python { .api }
384
class LLM:
385
def sleep(self, level: int = 1) -> None:
386
"""
387
Put engine to sleep to free resources.
388
389
Parameters:
390
- level: Sleep level (1=light, 2=deep)
391
"""
392
393
def wake_up(self, tags: Optional[list[str]] = None) -> None:
394
"""
395
Wake up sleeping engine.
396
397
Parameters:
398
- tags: Optional tags for selective wake-up
399
"""
400
401
def get_metrics(self) -> dict[str, Any]:
402
"""
403
Get Prometheus metrics for monitoring (V1 engine only).
404
405
Returns:
406
Dictionary of metrics and values
407
"""
408
```
409
410
### Chat Message Processing
411
412
Preprocess chat messages into standardized prompt format.
413
414
```python { .api }
415
class LLM:
416
def preprocess_chat(
417
self,
418
messages: List[ChatCompletionMessageParam],
419
lora_request: Optional[LoRARequest] = None,
420
chat_template: Optional[str] = None,
421
chat_template_content_format: ChatTemplateContentFormatOption = "auto",
422
add_generation_prompt: bool = True,
423
continue_final_message: bool = False,
424
tools: Optional[list[dict[str, Any]]] = None,
425
chat_template_kwargs: Optional[dict[str, Any]] = None,
426
mm_processor_kwargs: Optional[dict[str, Any]] = None
427
) -> "TokensPrompt":
428
"""
429
Preprocess chat messages into TokensPrompt format.
430
431
Parameters:
432
- messages: List of chat completion messages
433
- lora_request: Optional LoRA adapter configuration
434
- chat_template: Optional custom chat template
435
- chat_template_content_format: Content format option
436
- add_generation_prompt: Whether to add generation prompt
437
- continue_final_message: Whether to continue final message
438
- tools: Optional list of available tools
439
- chat_template_kwargs: Additional template arguments
440
- mm_processor_kwargs: Multimodal processor arguments
441
442
Returns:
443
Preprocessed TokensPrompt ready for generation
444
"""
445
```
446
447
### Model Registry and Discovery
448
449
Model registry system for discovering supported model architectures, checking model capabilities, and managing model metadata.
450
451
```python { .api }
452
class ModelRegistry:
453
@staticmethod
454
def get_supported_archs() -> list[str]:
455
"""Get list of supported model architectures."""
456
457
@staticmethod
458
def get_supported_models() -> list[str]:
459
"""Get list of all supported model names."""
460
461
@staticmethod
462
def get_model_info(model_arch: str) -> dict[str, Any]:
463
"""
464
Get detailed information about a model architecture.
465
466
Parameters:
467
- model_arch: Model architecture name
468
469
Returns:
470
Dictionary with model architecture details
471
"""
472
473
@staticmethod
474
def is_text_generation_model(model_arch: str) -> bool:
475
"""Check if model supports text generation."""
476
477
@staticmethod
478
def is_embedding_model(model_arch: str) -> bool:
479
"""Check if model supports embeddings."""
480
481
@staticmethod
482
def is_multimodal_model(model_arch: str) -> bool:
483
"""Check if model supports multimodal inputs."""
484
```
485
486
### Distributed Computing Utilities
487
488
Ray-based distributed computing initialization and management for multi-node inference deployments.
489
490
```python { .api }
491
def initialize_ray_cluster(
492
parallel_config: ParallelConfig,
493
engine_use_ray: bool = False,
494
ray_address: Optional[str] = None
495
) -> None:
496
"""
497
Initialize Ray cluster for distributed inference.
498
499
Parameters:
500
- parallel_config: Parallelism configuration
501
- engine_use_ray: Whether engine uses Ray
502
- ray_address: Ray cluster address
503
"""
504
```
505
506
### Command Line Interface
507
508
CLI entry point for vLLM server and utilities, providing OpenAI-compatible API server and benchmarking tools.
509
510
```bash
511
# Start OpenAI-compatible API server
512
vllm serve microsoft/DialoGPT-medium --host 0.0.0.0 --port 8000
513
514
# Run performance benchmarks
515
vllm benchmark --model microsoft/DialoGPT-medium --input-len 512 --output-len 128
516
```
517
518
### Version and Metadata
519
520
Package version information and backward compatibility utilities.
521
522
```python { .api }
523
__version__: str # Package version string
524
__version_tuple__: Tuple[int, int, int] # Version as tuple
525
526
def bc_linter_skip(func):
527
"""Skip backward compatibility linting for function."""
528
529
def bc_linter_include(func):
530
"""Include function in backward compatibility linting."""
531
```
532
533
## Types
534
535
```python { .api }
536
PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
537
SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
538
Sequence = Union[list, tuple]
539
PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]
540
541
class TextPrompt:
542
prompt: str
543
multi_modal_data: Optional[MultiModalDataDict] = None
544
545
class TokensPrompt:
546
prompt_token_ids: list[int]
547
multi_modal_data: Optional[MultiModalDataDict] = None
548
549
class EmbedsPrompt:
550
embedding: list[float]
551
multi_modal_data: Optional[MultiModalDataDict] = None
552
553
class RequestOutput:
554
request_id: str
555
prompt: Optional[str]
556
prompt_token_ids: list[int]
557
outputs: list[CompletionOutput]
558
finished: bool
559
metrics: Optional[RequestMetrics] = None
560
lora_request: Optional[LoRARequest] = None
561
562
class CompletionOutput:
563
index: int
564
text: str
565
token_ids: list[int]
566
cumulative_logprob: Optional[float]
567
logprobs: Optional[SampleLogprobs]
568
finish_reason: Optional[str] = None
569
stop_reason: Union[int, str, None] = None
570
lora_request: Optional[LoRARequest] = None
571
572
class PoolingRequestOutput:
573
id: str
574
outputs: PoolingOutput
575
prompt_token_ids: list[int]
576
finished: bool
577
578
class EmbeddingRequestOutput:
579
id: str
580
outputs: EmbeddingOutput
581
prompt_token_ids: list[int]
582
finished: bool
583
584
class ClassificationRequestOutput:
585
id: str
586
outputs: ClassificationOutput
587
prompt_token_ids: list[int]
588
finished: bool
589
590
class ScoringRequestOutput:
591
id: str
592
outputs: ScoringOutput
593
prompt_token_ids: list[int]
594
finished: bool
595
596
class EmbeddingOutput:
597
embedding: list[float]
598
599
class ClassificationOutput:
600
probs: list[float]
601
602
class ScoringOutput:
603
score: float
604
605
class BeamSearchOutput:
606
sequences: list[BeamSearchSequence]
607
finished: bool
608
609
class BeamSearchSequence:
610
text: str
611
token_ids: list[int]
612
cumulative_logprob: float
613
614
class DataPrompt(TypedDict):
615
data: Any
616
data_format: str
617
618
class EmbedsPrompt(TypedDict):
619
prompt_embeds: "torch.Tensor"
620
cache_salt: NotRequired[str]
621
622
class ExplicitEncoderDecoderPrompt(TypedDict):
623
encoder_prompt: Any
624
decoder_prompt: Optional[Any]
625
mm_processor_kwargs: NotRequired[dict[str, Any]]
626
627
# Enhanced TextPrompt with all fields
628
class TextPrompt(TypedDict):
629
prompt: str
630
multi_modal_data: Optional[MultiModalDataDict]
631
multi_modal_uuids: NotRequired["MultiModalUUIDDict"]
632
cache_salt: NotRequired[str]
633
634
# Enhanced TokensPrompt with all fields
635
class TokensPrompt(TypedDict):
636
prompt_token_ids: list[int]
637
prompt: NotRequired[str]
638
token_type_ids: NotRequired[list[int]]
639
multi_modal_data: Optional[MultiModalDataDict]
640
multi_modal_uuids: NotRequired["MultiModalUUIDDict"]
641
cache_salt: NotRequired[str]
642
643
# Core enums
644
class SamplingType(IntEnum):
645
GREEDY = 0
646
RANDOM = 1
647
RANDOM_SEED = 2
648
649
class RequestOutputKind(Enum):
650
CUMULATIVE = 0 # Return entire output so far
651
DELTA = 1 # Return only deltas
652
FINAL_ONLY = 2 # Do not return intermediate output
653
654
# Enhanced type aliases
655
PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt, ExplicitEncoderDecoderPrompt]
656
SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]
657
PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]
658
ChatTemplateContentFormatOption = Literal["auto", "string", "openai"]
659
660
# Utility functions
661
def is_tokens_prompt(prompt: SingletonPrompt) -> "TypeIs[TokensPrompt]": ...
662
def is_embeds_prompt(prompt: SingletonPrompt) -> "TypeIs[EmbedsPrompt]": ...
663
664
# Version information
665
__version__: str
666
__version_tuple__: Tuple[int, int, int]
667
668
def bc_linter_skip(func):
669
"""Skip backward compatibility linting for function."""
670
671
def bc_linter_include(func):
672
"""Include function in backward compatibility linting."""
673
```