or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/vllm@0.10.x

To install, run

npx @tessl/cli install tessl/pypi-vllm@0.10.0

0

# vLLM

1

2

vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It provides advanced techniques like continuous batching, paged attention, and GPU memory optimization to maximize inference speed while minimizing memory footprint, making it ideal for production deployments.

3

4

## Package Information

5

6

- **Package Name**: vllm

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install vllm`

10

- **Repository**: https://github.com/vllm-project/vllm

11

12

## Core Imports

13

14

```python

15

import vllm

16

```

17

18

For the main LLM interface:

19

20

```python

21

from vllm import LLM, SamplingParams

22

```

23

24

For output types:

25

26

```python

27

from vllm import (

28

RequestOutput, CompletionOutput,

29

EmbeddingRequestOutput, ClassificationRequestOutput,

30

ScoringRequestOutput, PoolingRequestOutput

31

)

32

```

33

34

For async usage:

35

36

```python

37

from vllm import AsyncLLMEngine, AsyncEngineArgs

38

```

39

40

For additional parameters:

41

42

```python

43

from vllm import PoolingParams

44

from vllm.lora.request import LoRARequest

45

from vllm.sampling_params import GuidedDecodingParams, BeamSearchParams

46

```

47

48

## Basic Usage

49

50

```python

51

from vllm import LLM, SamplingParams

52

53

# Create an LLM instance

54

llm = LLM(model="microsoft/DialoGPT-medium")

55

56

# Define sampling parameters

57

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

58

59

# Generate text

60

prompts = [

61

"Hello, my name is",

62

"The capital of France is",

63

"The future of AI is",

64

]

65

66

outputs = llm.generate(prompts, sampling_params)

67

68

# Print results

69

for output in outputs:

70

prompt = output.prompt

71

generated_text = output.outputs[0].text

72

print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

73

```

74

75

## Architecture

76

77

vLLM's architecture is built around several key components:

78

79

- **LLM Class**: High-level interface providing simple text generation, chat completions, embeddings, and classification

80

- **Engine Layer**: Core inference engines (LLMEngine for sync, AsyncLLMEngine for async) managing request scheduling and batching

81

- **Model Executor**: Distributed model execution with support for tensor parallelism and pipeline parallelism

82

- **Memory Management**: Advanced memory optimization with paged attention and efficient KV cache management

83

- **Sampling System**: Sophisticated sampling strategies including beam search, guided decoding, and structured output generation

84

85

This design enables high-throughput serving with intelligent request batching, efficient resource utilization across multiple GPUs, and support for advanced inference techniques that are critical for production LLM deployments.

86

87

## Capabilities

88

89

### Text Generation

90

91

Primary text generation functionality with support for various prompt formats, sampling parameters, and generation strategies including beam search and guided decoding.

92

93

```python { .api }

94

class LLM:

95

def generate(

96

self,

97

prompts: Union[PromptType, Sequence[PromptType]],

98

sampling_params: Optional[Union[SamplingParams, Sequence[SamplingParams]]] = None,

99

*,

100

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

101

lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,

102

priority: Optional[List[int]] = None

103

) -> List[RequestOutput]: ...

104

```

105

106

[Text Generation](./text-generation.md)

107

108

### Chat Completions

109

110

Conversational AI interface supporting chat templates, tool calling, and multi-turn conversations with proper message formatting and context management.

111

112

```python { .api }

113

class LLM:

114

def chat(

115

self,

116

messages: Union[list[ChatCompletionMessageParam], list[list[ChatCompletionMessageParam]]],

117

sampling_params: Optional[Union[SamplingParams, list[SamplingParams]]] = None,

118

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

119

lora_request: Optional[LoRARequest] = None,

120

chat_template: Optional[str] = None,

121

chat_template_content_format: ChatTemplateContentFormatOption = "auto",

122

add_generation_prompt: bool = True,

123

continue_final_message: bool = False,

124

tools: Optional[list[dict[str, Any]]] = None,

125

chat_template_kwargs: Optional[dict[str, Any]] = None,

126

mm_processor_kwargs: Optional[dict[str, Any]] = None

127

) -> list[RequestOutput]: ...

128

```

129

130

[Chat Completions](./chat-completions.md)

131

132

### Text Embeddings

133

134

Text encoding and embedding generation for semantic similarity, retrieval applications, and downstream NLP tasks with support for various pooling strategies.

135

136

```python { .api }

137

class LLM:

138

def encode(

139

self,

140

prompts: Union[PromptType, Sequence[PromptType], DataPrompt],

141

pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,

142

*,

143

truncate_prompt_tokens: Optional[int] = None,

144

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

145

lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,

146

pooling_task: PoolingTask = "encode",

147

tokenization_kwargs: Optional[dict[str, Any]] = None

148

) -> list[PoolingRequestOutput]: ...

149

```

150

151

[Text Embeddings](./text-embeddings.md)

152

153

### Text Classification

154

155

Text classification functionality with predefined class labels, supporting various pooling strategies and confidence scoring for categorization tasks.

156

157

```python { .api }

158

class LLM:

159

def classify(

160

self,

161

prompts: Union[PromptType, Sequence[PromptType]],

162

*,

163

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

164

pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,

165

lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None

166

) -> list[ClassificationRequestOutput]: ...

167

```

168

169

[Text Classification](./text-classification.md)

170

171

### Text Scoring

172

173

Text similarity and likelihood scoring for comparing text pairs, ranking, and evaluation tasks with support for various scoring methods.

174

175

```python { .api }

176

class LLM:

177

def score(

178

self,

179

data_1: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],

180

data_2: Union[SingletonPrompt, Sequence[SingletonPrompt], ScoreMultiModalParam],

181

/,

182

*,

183

truncate_prompt_tokens: Optional[int] = None,

184

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

185

pooling_params: Optional[PoolingParams] = None,

186

lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None

187

) -> list[ScoringRequestOutput]: ...

188

```

189

190

[Text Scoring](./text-scoring.md)

191

192

### Embedding Generation

193

194

Specialized embedding generation method for encoding text into vector representations with automatic normalization and optimal pooling strategies.

195

196

```python { .api }

197

class LLM:

198

def embed(

199

self,

200

prompts: Union[PromptType, Sequence[PromptType]],

201

*,

202

truncate_prompt_tokens: Optional[int] = None,

203

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

204

pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,

205

lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None

206

) -> list[EmbeddingRequestOutput]: ...

207

```

208

209

### Beam Search Generation

210

211

Advanced beam search generation for exploring multiple generation paths and finding high-quality outputs through systematic search.

212

213

```python { .api }

214

class LLM:

215

def beam_search(

216

self,

217

prompts: Union[PromptType, Sequence[PromptType]],

218

params: BeamSearchParams

219

) -> list[BeamSearchOutput]: ...

220

```

221

222

### Reward Modeling

223

224

Generate reward scores for text evaluation, preference learning, and RLHF applications.

225

226

```python { .api }

227

class LLM:

228

def reward(

229

self,

230

prompts: Union[PromptType, Sequence[PromptType]],

231

*,

232

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

233

pooling_params: Optional[Union[PoolingParams, Sequence[PoolingParams]]] = None,

234

lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None

235

) -> list[PoolingRequestOutput]: ...

236

```

237

238

### Asynchronous Inference

239

240

High-performance asynchronous inference engine for concurrent request handling, streaming responses, and integration with async frameworks.

241

242

```python { .api }

243

class AsyncLLMEngine:

244

async def generate(

245

self,

246

prompt: Optional[str],

247

sampling_params: SamplingParams,

248

request_id: str,

249

prompt_token_ids: Optional[List[int]] = None,

250

lora_request: Optional[LoRARequest] = None

251

) -> AsyncIterator[RequestOutput]: ...

252

```

253

254

[Asynchronous Inference](./async-inference.md)

255

256

### Configuration and Engine Management

257

258

Comprehensive configuration system for model loading, distributed execution, memory management, and performance tuning across various deployment scenarios.

259

260

```python { .api }

261

class EngineArgs:

262

model: str

263

tokenizer: Optional[str] = None

264

tokenizer_mode: str = "auto"

265

trust_remote_code: bool = False

266

tensor_parallel_size: int = 1

267

dtype: str = "auto"

268

quantization: Optional[str] = None

269

max_model_len: Optional[int] = None

270

gpu_memory_utilization: float = 0.9

271

# ... many more configuration options

272

```

273

274

[Configuration](./configuration.md)

275

276

### Parameters and Data Types

277

278

Essential parameter classes and data types for controlling generation behavior, defining inputs and outputs, and managing model configurations.

279

280

```python { .api }

281

class SamplingParams:

282

n: int = 1

283

temperature: float = 1.0

284

top_p: float = 1.0

285

top_k: int = -1

286

stop: Optional[Union[str, List[str]]] = None

287

max_tokens: Optional[int] = None

288

# ... many more sampling parameters

289

```

290

291

[Parameters and Types](./parameters-types.md)

292

293

### Tokenizer Management

294

295

Access and manage tokenizers with support for LoRA adapters and custom tokenization.

296

297

```python { .api }

298

class LLM:

299

def get_tokenizer(self, lora_request: Optional[LoRARequest] = None) -> "PreTrainedTokenizerBase":

300

"""

301

Get tokenizer with optional LoRA adapter support.

302

303

Parameters:

304

- lora_request: Optional LoRA adapter configuration

305

306

Returns:

307

Tokenizer instance configured for this model

308

"""

309

310

def set_tokenizer(self, tokenizer: "PreTrainedTokenizerBase") -> None:

311

"""

312

Set a custom tokenizer for this LLM instance.

313

314

Parameters:

315

- tokenizer: Custom tokenizer to use for this model

316

"""

317

318

def get_default_sampling_params(self) -> SamplingParams:

319

"""

320

Get default sampling parameters from model configuration.

321

322

Returns:

323

Default SamplingParams instance based on model config

324

"""

325

```

326

327

### Advanced Model Management

328

329

Direct model access, profiling, and distributed computing capabilities.

330

331

```python { .api }

332

class LLM:

333

def collective_rpc(

334

self,

335

method: str,

336

timeout: Optional[float] = None,

337

args: tuple[Any, ...] = (),

338

kwargs: Optional[dict[str, Any]] = None

339

) -> list[Any]:

340

"""

341

Execute RPC calls on all model workers.

342

343

Parameters:

344

- method: Method name to call on workers

345

- timeout: Optional timeout for RPC calls

346

- args: Positional arguments for method

347

- kwargs: Keyword arguments for method

348

349

Returns:

350

List of results from all workers

351

"""

352

353

def apply_model(self, func: Callable) -> list[Any]:

354

"""

355

Apply function directly to model in each worker.

356

357

Parameters:

358

- func: Function to apply to model instances

359

360

Returns:

361

Results from applying function to all model instances

362

"""

363

364

def start_profile(self) -> None:

365

"""Start performance profiling for this LLM instance."""

366

367

def stop_profile(self) -> None:

368

"""Stop performance profiling and save results."""

369

370

def reset_prefix_cache(self, device: Optional[Union[str, int]] = None) -> None:

371

"""

372

Reset prefix cache for memory optimization.

373

374

Parameters:

375

- device: Optional device specification for cache reset

376

"""

377

```

378

379

### Resource Management

380

381

Control engine sleep/wake states and retrieve metrics for monitoring.

382

383

```python { .api }

384

class LLM:

385

def sleep(self, level: int = 1) -> None:

386

"""

387

Put engine to sleep to free resources.

388

389

Parameters:

390

- level: Sleep level (1=light, 2=deep)

391

"""

392

393

def wake_up(self, tags: Optional[list[str]] = None) -> None:

394

"""

395

Wake up sleeping engine.

396

397

Parameters:

398

- tags: Optional tags for selective wake-up

399

"""

400

401

def get_metrics(self) -> dict[str, Any]:

402

"""

403

Get Prometheus metrics for monitoring (V1 engine only).

404

405

Returns:

406

Dictionary of metrics and values

407

"""

408

```

409

410

### Chat Message Processing

411

412

Preprocess chat messages into standardized prompt format.

413

414

```python { .api }

415

class LLM:

416

def preprocess_chat(

417

self,

418

messages: List[ChatCompletionMessageParam],

419

lora_request: Optional[LoRARequest] = None,

420

chat_template: Optional[str] = None,

421

chat_template_content_format: ChatTemplateContentFormatOption = "auto",

422

add_generation_prompt: bool = True,

423

continue_final_message: bool = False,

424

tools: Optional[list[dict[str, Any]]] = None,

425

chat_template_kwargs: Optional[dict[str, Any]] = None,

426

mm_processor_kwargs: Optional[dict[str, Any]] = None

427

) -> "TokensPrompt":

428

"""

429

Preprocess chat messages into TokensPrompt format.

430

431

Parameters:

432

- messages: List of chat completion messages

433

- lora_request: Optional LoRA adapter configuration

434

- chat_template: Optional custom chat template

435

- chat_template_content_format: Content format option

436

- add_generation_prompt: Whether to add generation prompt

437

- continue_final_message: Whether to continue final message

438

- tools: Optional list of available tools

439

- chat_template_kwargs: Additional template arguments

440

- mm_processor_kwargs: Multimodal processor arguments

441

442

Returns:

443

Preprocessed TokensPrompt ready for generation

444

"""

445

```

446

447

### Model Registry and Discovery

448

449

Model registry system for discovering supported model architectures, checking model capabilities, and managing model metadata.

450

451

```python { .api }

452

class ModelRegistry:

453

@staticmethod

454

def get_supported_archs() -> list[str]:

455

"""Get list of supported model architectures."""

456

457

@staticmethod

458

def get_supported_models() -> list[str]:

459

"""Get list of all supported model names."""

460

461

@staticmethod

462

def get_model_info(model_arch: str) -> dict[str, Any]:

463

"""

464

Get detailed information about a model architecture.

465

466

Parameters:

467

- model_arch: Model architecture name

468

469

Returns:

470

Dictionary with model architecture details

471

"""

472

473

@staticmethod

474

def is_text_generation_model(model_arch: str) -> bool:

475

"""Check if model supports text generation."""

476

477

@staticmethod

478

def is_embedding_model(model_arch: str) -> bool:

479

"""Check if model supports embeddings."""

480

481

@staticmethod

482

def is_multimodal_model(model_arch: str) -> bool:

483

"""Check if model supports multimodal inputs."""

484

```

485

486

### Distributed Computing Utilities

487

488

Ray-based distributed computing initialization and management for multi-node inference deployments.

489

490

```python { .api }

491

def initialize_ray_cluster(

492

parallel_config: ParallelConfig,

493

engine_use_ray: bool = False,

494

ray_address: Optional[str] = None

495

) -> None:

496

"""

497

Initialize Ray cluster for distributed inference.

498

499

Parameters:

500

- parallel_config: Parallelism configuration

501

- engine_use_ray: Whether engine uses Ray

502

- ray_address: Ray cluster address

503

"""

504

```

505

506

### Command Line Interface

507

508

CLI entry point for vLLM server and utilities, providing OpenAI-compatible API server and benchmarking tools.

509

510

```bash

511

# Start OpenAI-compatible API server

512

vllm serve microsoft/DialoGPT-medium --host 0.0.0.0 --port 8000

513

514

# Run performance benchmarks

515

vllm benchmark --model microsoft/DialoGPT-medium --input-len 512 --output-len 128

516

```

517

518

### Version and Metadata

519

520

Package version information and backward compatibility utilities.

521

522

```python { .api }

523

__version__: str # Package version string

524

__version_tuple__: Tuple[int, int, int] # Version as tuple

525

526

def bc_linter_skip(func):

527

"""Skip backward compatibility linting for function."""

528

529

def bc_linter_include(func):

530

"""Include function in backward compatibility linting."""

531

```

532

533

## Types

534

535

```python { .api }

536

PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]

537

SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]

538

Sequence = Union[list, tuple]

539

PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]

540

541

class TextPrompt:

542

prompt: str

543

multi_modal_data: Optional[MultiModalDataDict] = None

544

545

class TokensPrompt:

546

prompt_token_ids: list[int]

547

multi_modal_data: Optional[MultiModalDataDict] = None

548

549

class EmbedsPrompt:

550

embedding: list[float]

551

multi_modal_data: Optional[MultiModalDataDict] = None

552

553

class RequestOutput:

554

request_id: str

555

prompt: Optional[str]

556

prompt_token_ids: list[int]

557

outputs: list[CompletionOutput]

558

finished: bool

559

metrics: Optional[RequestMetrics] = None

560

lora_request: Optional[LoRARequest] = None

561

562

class CompletionOutput:

563

index: int

564

text: str

565

token_ids: list[int]

566

cumulative_logprob: Optional[float]

567

logprobs: Optional[SampleLogprobs]

568

finish_reason: Optional[str] = None

569

stop_reason: Union[int, str, None] = None

570

lora_request: Optional[LoRARequest] = None

571

572

class PoolingRequestOutput:

573

id: str

574

outputs: PoolingOutput

575

prompt_token_ids: list[int]

576

finished: bool

577

578

class EmbeddingRequestOutput:

579

id: str

580

outputs: EmbeddingOutput

581

prompt_token_ids: list[int]

582

finished: bool

583

584

class ClassificationRequestOutput:

585

id: str

586

outputs: ClassificationOutput

587

prompt_token_ids: list[int]

588

finished: bool

589

590

class ScoringRequestOutput:

591

id: str

592

outputs: ScoringOutput

593

prompt_token_ids: list[int]

594

finished: bool

595

596

class EmbeddingOutput:

597

embedding: list[float]

598

599

class ClassificationOutput:

600

probs: list[float]

601

602

class ScoringOutput:

603

score: float

604

605

class BeamSearchOutput:

606

sequences: list[BeamSearchSequence]

607

finished: bool

608

609

class BeamSearchSequence:

610

text: str

611

token_ids: list[int]

612

cumulative_logprob: float

613

614

class DataPrompt(TypedDict):

615

data: Any

616

data_format: str

617

618

class EmbedsPrompt(TypedDict):

619

prompt_embeds: "torch.Tensor"

620

cache_salt: NotRequired[str]

621

622

class ExplicitEncoderDecoderPrompt(TypedDict):

623

encoder_prompt: Any

624

decoder_prompt: Optional[Any]

625

mm_processor_kwargs: NotRequired[dict[str, Any]]

626

627

# Enhanced TextPrompt with all fields

628

class TextPrompt(TypedDict):

629

prompt: str

630

multi_modal_data: Optional[MultiModalDataDict]

631

multi_modal_uuids: NotRequired["MultiModalUUIDDict"]

632

cache_salt: NotRequired[str]

633

634

# Enhanced TokensPrompt with all fields

635

class TokensPrompt(TypedDict):

636

prompt_token_ids: list[int]

637

prompt: NotRequired[str]

638

token_type_ids: NotRequired[list[int]]

639

multi_modal_data: Optional[MultiModalDataDict]

640

multi_modal_uuids: NotRequired["MultiModalUUIDDict"]

641

cache_salt: NotRequired[str]

642

643

# Core enums

644

class SamplingType(IntEnum):

645

GREEDY = 0

646

RANDOM = 1

647

RANDOM_SEED = 2

648

649

class RequestOutputKind(Enum):

650

CUMULATIVE = 0 # Return entire output so far

651

DELTA = 1 # Return only deltas

652

FINAL_ONLY = 2 # Do not return intermediate output

653

654

# Enhanced type aliases

655

PromptType = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt, ExplicitEncoderDecoderPrompt]

656

SingletonPrompt = Union[str, TextPrompt, TokensPrompt, EmbedsPrompt]

657

PoolingTask = Literal["encode", "embed", "classify", "reward", "score"]

658

ChatTemplateContentFormatOption = Literal["auto", "string", "openai"]

659

660

# Utility functions

661

def is_tokens_prompt(prompt: SingletonPrompt) -> "TypeIs[TokensPrompt]": ...

662

def is_embeds_prompt(prompt: SingletonPrompt) -> "TypeIs[EmbedsPrompt]": ...

663

664

# Version information

665

__version__: str

666

__version_tuple__: Tuple[int, int, int]

667

668

def bc_linter_skip(func):

669

"""Skip backward compatibility linting for function."""

670

671

def bc_linter_include(func):

672

"""Include function in backward compatibility linting."""

673

```