0
# Configuration
1
2
Comprehensive configuration system for vLLM engine initialization, model loading, distributed execution, and performance optimization. Controls everything from basic model selection to advanced deployment scenarios across multiple GPUs and nodes.
3
4
## Capabilities
5
6
### Engine Arguments
7
8
Primary configuration class for initializing LLM engines with extensive options for model, tokenizer, execution, and performance settings.
9
10
```python { .api }
11
class EngineArgs:
12
# Model Configuration
13
model: str # HuggingFace model name or local path
14
tokenizer: Optional[str] = None # Tokenizer name/path (defaults to model)
15
tokenizer_mode: str = "auto" # "auto", "slow", or "fast"
16
revision: Optional[str] = None # Model revision/branch
17
code_revision: Optional[str] = None # Code revision for remote code
18
tokenizer_revision: Optional[str] = None # Tokenizer revision
19
trust_remote_code: bool = False # Execute remote code
20
download_dir: Optional[str] = None # Model download directory
21
load_format: str = "auto" # Model loading format
22
config_format: str = "auto" # Config loading format
23
24
# Model Execution
25
dtype: str = "auto" # Model precision ("auto", "half", "float16", "bfloat16", "float32")
26
kv_cache_dtype: str = "auto" # KV cache data type
27
quantization_param_path: Optional[str] = None # Quantization parameters
28
quantization: Optional[str] = None # Quantization method ("awq", "gptq", "squeezellm", "fp8")
29
30
# Memory and Performance
31
gpu_memory_utilization: float = 0.9 # GPU memory usage fraction
32
swap_space: int = 4 # CPU swap space in GiB
33
cpu_offload_gb: float = 0 # CPU offload memory in GB
34
max_model_len: Optional[int] = None # Maximum sequence length
35
max_num_batched_tokens: Optional[int] = None # Maximum batch size in tokens
36
max_num_seqs: int = 256 # Maximum concurrent sequences
37
max_logprobs: int = 20 # Maximum logprobs to return
38
39
# Parallelism and Distribution
40
tensor_parallel_size: int = 1 # Tensor parallelism degree
41
pipeline_parallel_size: int = 1 # Pipeline parallelism degree
42
max_parallel_loading_workers: Optional[int] = None # Model loading workers
43
ray_workers_use_nsight: bool = False # Enable Nsight profiling
44
block_size: int = 16 # Attention block size
45
enable_prefix_caching: bool = False # Enable prefix caching
46
disable_custom_all_reduce: bool = False # Disable custom all-reduce
47
48
# Advanced Options
49
preemption_mode: Optional[str] = None # Preemption strategy
50
num_lookahead_slots: int = 0 # Speculative decoding slots
51
seed: int = 0 # Random seed
52
num_gpu_blocks_override: Optional[int] = None # Override GPU block count
53
max_seq_len_to_capture: int = 8192 # Maximum sequence length for CUDA graphs
54
disable_sliding_window: bool = False # Disable sliding window attention
55
56
# Multimodal Support
57
image_input_type: Optional[str] = None # Image input format
58
image_token_id: Optional[int] = None # Image token ID
59
image_input_shape: Optional[str] = None # Image input dimensions
60
image_feature_size: Optional[int] = None # Image feature size
61
scheduler_delay_factor: float = 0.0 # Scheduler delay factor
62
enable_chunked_prefill: Optional[bool] = None # Chunked prefill optimization
63
```
64
65
### Async Engine Arguments
66
67
Extended configuration for asynchronous inference engines with additional options for concurrent request handling and streaming.
68
69
```python { .api }
70
class AsyncEngineArgs(EngineArgs):
71
# Inherits all EngineArgs options plus:
72
worker_use_ray: bool = False # Use Ray for distributed workers
73
engine_use_ray: bool = False # Use Ray for engine management
74
disable_log_requests: bool = False # Disable request logging
75
max_log_len: Optional[int] = None # Maximum log length
76
```
77
78
### Model Configuration Options
79
80
Specialized configuration options for different model types and architectures.
81
82
```python { .api }
83
# Model Data Types
84
class ModelDType(str, Enum):
85
AUTO = "auto"
86
HALF = "half"
87
FLOAT16 = "float16"
88
BFLOAT16 = "bfloat16"
89
FLOAT32 = "float32"
90
91
# Quantization Methods
92
class QuantizationMethods(str, Enum):
93
AWQ = "awq"
94
GPTQ = "gptq"
95
SQUEEZELLM = "squeezellm"
96
FP8 = "fp8"
97
BITSANDBYTES = "bitsandbytes"
98
99
# Load Formats
100
class LoadFormats(str, Enum):
101
AUTO = "auto"
102
PT = "pt"
103
SAFETENSORS = "safetensors"
104
NPCACHE = "npcache"
105
DUMMY = "dummy"
106
```
107
108
### Device and Platform Configuration
109
110
Configuration options for different hardware platforms and deployment environments.
111
112
```python { .api }
113
class Device(str, Enum):
114
GPU = "gpu"
115
CPU = "cpu"
116
TPU = "tpu"
117
XPU = "xpu"
118
119
class DeviceConfig:
120
device: Device # Target device type
121
device_ids: Optional[List[int]] = None # Specific device IDs
122
placement_group: Optional[str] = None # Ray placement group
123
```
124
125
## Usage Examples
126
127
### Basic GPU Configuration
128
129
```python
130
from vllm import LLM, EngineArgs
131
132
# Simple GPU setup
133
args = EngineArgs(
134
model="microsoft/DialoGPT-medium",
135
tensor_parallel_size=1,
136
gpu_memory_utilization=0.8,
137
max_model_len=2048
138
)
139
140
llm = LLM(**args.to_dict())
141
```
142
143
### Multi-GPU Distributed Setup
144
145
```python
146
from vllm import LLM, EngineArgs
147
148
# Multi-GPU configuration
149
args = EngineArgs(
150
model="microsoft/DialoGPT-large",
151
tensor_parallel_size=4, # Use 4 GPUs
152
pipeline_parallel_size=2, # Pipeline across 2 stages
153
gpu_memory_utilization=0.9,
154
max_model_len=4096,
155
trust_remote_code=True
156
)
157
158
llm = LLM(**args.to_dict())
159
```
160
161
### Quantized Model Configuration
162
163
```python
164
from vllm import LLM, EngineArgs
165
166
# AWQ quantized model
167
args = EngineArgs(
168
model="microsoft/DialoGPT-medium-awq",
169
quantization="awq",
170
dtype="half",
171
gpu_memory_utilization=0.95,
172
max_model_len=8192
173
)
174
175
llm = LLM(**args.to_dict())
176
```
177
178
### Async Engine Configuration
179
180
```python
181
from vllm import AsyncLLMEngine, AsyncEngineArgs
182
183
# Async engine with Ray
184
async_args = AsyncEngineArgs(
185
model="microsoft/DialoGPT-medium",
186
worker_use_ray=True,
187
engine_use_ray=True,
188
tensor_parallel_size=2,
189
max_num_seqs=128,
190
gpu_memory_utilization=0.9
191
)
192
193
engine = AsyncLLMEngine.from_engine_args(async_args)
194
```
195
196
### Memory-Optimized Configuration
197
198
```python
199
from vllm import LLM, EngineArgs
200
201
# Optimize for memory efficiency
202
args = EngineArgs(
203
model="microsoft/DialoGPT-small",
204
gpu_memory_utilization=0.95,
205
swap_space=8, # 8GB CPU swap
206
cpu_offload_gb=2, # Offload 2GB to CPU
207
max_num_batched_tokens=1024, # Smaller batches
208
enable_prefix_caching=True, # Cache common prefixes
209
block_size=8 # Smaller attention blocks
210
)
211
212
llm = LLM(**args.to_dict())
213
```
214
215
### Development and Debugging Configuration
216
217
```python
218
from vllm import LLM, EngineArgs
219
220
# Development setup with detailed logging
221
args = EngineArgs(
222
model="microsoft/DialoGPT-medium",
223
tensor_parallel_size=1,
224
seed=42, # Reproducible results
225
disable_custom_all_reduce=True, # Use standard operations
226
max_logprobs=10, # Detailed probability info
227
trust_remote_code=True, # For custom models
228
revision="main" # Specific model version
229
)
230
231
llm = LLM(**args.to_dict())
232
```
233
234
## Configuration Validation
235
236
```python { .api }
237
def validate_config(args: EngineArgs) -> None:
238
"""
239
Validate engine configuration parameters.
240
241
Raises:
242
ValueError: If configuration is invalid
243
RuntimeError: If hardware requirements are not met
244
"""
245
246
def get_default_config_for_device(device: Device) -> EngineArgs:
247
"""
248
Get recommended default configuration for target device.
249
250
Parameters:
251
- device: Target deployment device
252
253
Returns:
254
EngineArgs with device-optimized defaults
255
"""
256
```
257
258
## Environment Variables
259
260
vLLM respects numerous environment variables for configuration:
261
262
```python
263
# Key environment variables
264
VLLM_WORKER_MULTIPROC_METHOD # Worker process method
265
VLLM_USE_MODELSCOPE # Use ModelScope for model downloads
266
VLLM_TARGET_DEVICE # Override target device
267
VLLM_GPU_MEMORY_UTILIZATION # Default GPU memory usage
268
VLLM_HOST # Server host address
269
VLLM_PORT # Server port
270
VLLM_USE_RAY_COMPILED_DAG # Use compiled Ray DAGs
271
```