or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

async-inference.mdchat-completions.mdconfiguration.mdindex.mdparameters-types.mdtext-classification.mdtext-embeddings.mdtext-generation.mdtext-scoring.md

configuration.mddocs/

0

# Configuration

1

2

Comprehensive configuration system for vLLM engine initialization, model loading, distributed execution, and performance optimization. Controls everything from basic model selection to advanced deployment scenarios across multiple GPUs and nodes.

3

4

## Capabilities

5

6

### Engine Arguments

7

8

Primary configuration class for initializing LLM engines with extensive options for model, tokenizer, execution, and performance settings.

9

10

```python { .api }

11

class EngineArgs:

12

# Model Configuration

13

model: str # HuggingFace model name or local path

14

tokenizer: Optional[str] = None # Tokenizer name/path (defaults to model)

15

tokenizer_mode: str = "auto" # "auto", "slow", or "fast"

16

revision: Optional[str] = None # Model revision/branch

17

code_revision: Optional[str] = None # Code revision for remote code

18

tokenizer_revision: Optional[str] = None # Tokenizer revision

19

trust_remote_code: bool = False # Execute remote code

20

download_dir: Optional[str] = None # Model download directory

21

load_format: str = "auto" # Model loading format

22

config_format: str = "auto" # Config loading format

23

24

# Model Execution

25

dtype: str = "auto" # Model precision ("auto", "half", "float16", "bfloat16", "float32")

26

kv_cache_dtype: str = "auto" # KV cache data type

27

quantization_param_path: Optional[str] = None # Quantization parameters

28

quantization: Optional[str] = None # Quantization method ("awq", "gptq", "squeezellm", "fp8")

29

30

# Memory and Performance

31

gpu_memory_utilization: float = 0.9 # GPU memory usage fraction

32

swap_space: int = 4 # CPU swap space in GiB

33

cpu_offload_gb: float = 0 # CPU offload memory in GB

34

max_model_len: Optional[int] = None # Maximum sequence length

35

max_num_batched_tokens: Optional[int] = None # Maximum batch size in tokens

36

max_num_seqs: int = 256 # Maximum concurrent sequences

37

max_logprobs: int = 20 # Maximum logprobs to return

38

39

# Parallelism and Distribution

40

tensor_parallel_size: int = 1 # Tensor parallelism degree

41

pipeline_parallel_size: int = 1 # Pipeline parallelism degree

42

max_parallel_loading_workers: Optional[int] = None # Model loading workers

43

ray_workers_use_nsight: bool = False # Enable Nsight profiling

44

block_size: int = 16 # Attention block size

45

enable_prefix_caching: bool = False # Enable prefix caching

46

disable_custom_all_reduce: bool = False # Disable custom all-reduce

47

48

# Advanced Options

49

preemption_mode: Optional[str] = None # Preemption strategy

50

num_lookahead_slots: int = 0 # Speculative decoding slots

51

seed: int = 0 # Random seed

52

num_gpu_blocks_override: Optional[int] = None # Override GPU block count

53

max_seq_len_to_capture: int = 8192 # Maximum sequence length for CUDA graphs

54

disable_sliding_window: bool = False # Disable sliding window attention

55

56

# Multimodal Support

57

image_input_type: Optional[str] = None # Image input format

58

image_token_id: Optional[int] = None # Image token ID

59

image_input_shape: Optional[str] = None # Image input dimensions

60

image_feature_size: Optional[int] = None # Image feature size

61

scheduler_delay_factor: float = 0.0 # Scheduler delay factor

62

enable_chunked_prefill: Optional[bool] = None # Chunked prefill optimization

63

```

64

65

### Async Engine Arguments

66

67

Extended configuration for asynchronous inference engines with additional options for concurrent request handling and streaming.

68

69

```python { .api }

70

class AsyncEngineArgs(EngineArgs):

71

# Inherits all EngineArgs options plus:

72

worker_use_ray: bool = False # Use Ray for distributed workers

73

engine_use_ray: bool = False # Use Ray for engine management

74

disable_log_requests: bool = False # Disable request logging

75

max_log_len: Optional[int] = None # Maximum log length

76

```

77

78

### Model Configuration Options

79

80

Specialized configuration options for different model types and architectures.

81

82

```python { .api }

83

# Model Data Types

84

class ModelDType(str, Enum):

85

AUTO = "auto"

86

HALF = "half"

87

FLOAT16 = "float16"

88

BFLOAT16 = "bfloat16"

89

FLOAT32 = "float32"

90

91

# Quantization Methods

92

class QuantizationMethods(str, Enum):

93

AWQ = "awq"

94

GPTQ = "gptq"

95

SQUEEZELLM = "squeezellm"

96

FP8 = "fp8"

97

BITSANDBYTES = "bitsandbytes"

98

99

# Load Formats

100

class LoadFormats(str, Enum):

101

AUTO = "auto"

102

PT = "pt"

103

SAFETENSORS = "safetensors"

104

NPCACHE = "npcache"

105

DUMMY = "dummy"

106

```

107

108

### Device and Platform Configuration

109

110

Configuration options for different hardware platforms and deployment environments.

111

112

```python { .api }

113

class Device(str, Enum):

114

GPU = "gpu"

115

CPU = "cpu"

116

TPU = "tpu"

117

XPU = "xpu"

118

119

class DeviceConfig:

120

device: Device # Target device type

121

device_ids: Optional[List[int]] = None # Specific device IDs

122

placement_group: Optional[str] = None # Ray placement group

123

```

124

125

## Usage Examples

126

127

### Basic GPU Configuration

128

129

```python

130

from vllm import LLM, EngineArgs

131

132

# Simple GPU setup

133

args = EngineArgs(

134

model="microsoft/DialoGPT-medium",

135

tensor_parallel_size=1,

136

gpu_memory_utilization=0.8,

137

max_model_len=2048

138

)

139

140

llm = LLM(**args.to_dict())

141

```

142

143

### Multi-GPU Distributed Setup

144

145

```python

146

from vllm import LLM, EngineArgs

147

148

# Multi-GPU configuration

149

args = EngineArgs(

150

model="microsoft/DialoGPT-large",

151

tensor_parallel_size=4, # Use 4 GPUs

152

pipeline_parallel_size=2, # Pipeline across 2 stages

153

gpu_memory_utilization=0.9,

154

max_model_len=4096,

155

trust_remote_code=True

156

)

157

158

llm = LLM(**args.to_dict())

159

```

160

161

### Quantized Model Configuration

162

163

```python

164

from vllm import LLM, EngineArgs

165

166

# AWQ quantized model

167

args = EngineArgs(

168

model="microsoft/DialoGPT-medium-awq",

169

quantization="awq",

170

dtype="half",

171

gpu_memory_utilization=0.95,

172

max_model_len=8192

173

)

174

175

llm = LLM(**args.to_dict())

176

```

177

178

### Async Engine Configuration

179

180

```python

181

from vllm import AsyncLLMEngine, AsyncEngineArgs

182

183

# Async engine with Ray

184

async_args = AsyncEngineArgs(

185

model="microsoft/DialoGPT-medium",

186

worker_use_ray=True,

187

engine_use_ray=True,

188

tensor_parallel_size=2,

189

max_num_seqs=128,

190

gpu_memory_utilization=0.9

191

)

192

193

engine = AsyncLLMEngine.from_engine_args(async_args)

194

```

195

196

### Memory-Optimized Configuration

197

198

```python

199

from vllm import LLM, EngineArgs

200

201

# Optimize for memory efficiency

202

args = EngineArgs(

203

model="microsoft/DialoGPT-small",

204

gpu_memory_utilization=0.95,

205

swap_space=8, # 8GB CPU swap

206

cpu_offload_gb=2, # Offload 2GB to CPU

207

max_num_batched_tokens=1024, # Smaller batches

208

enable_prefix_caching=True, # Cache common prefixes

209

block_size=8 # Smaller attention blocks

210

)

211

212

llm = LLM(**args.to_dict())

213

```

214

215

### Development and Debugging Configuration

216

217

```python

218

from vllm import LLM, EngineArgs

219

220

# Development setup with detailed logging

221

args = EngineArgs(

222

model="microsoft/DialoGPT-medium",

223

tensor_parallel_size=1,

224

seed=42, # Reproducible results

225

disable_custom_all_reduce=True, # Use standard operations

226

max_logprobs=10, # Detailed probability info

227

trust_remote_code=True, # For custom models

228

revision="main" # Specific model version

229

)

230

231

llm = LLM(**args.to_dict())

232

```

233

234

## Configuration Validation

235

236

```python { .api }

237

def validate_config(args: EngineArgs) -> None:

238

"""

239

Validate engine configuration parameters.

240

241

Raises:

242

ValueError: If configuration is invalid

243

RuntimeError: If hardware requirements are not met

244

"""

245

246

def get_default_config_for_device(device: Device) -> EngineArgs:

247

"""

248

Get recommended default configuration for target device.

249

250

Parameters:

251

- device: Target deployment device

252

253

Returns:

254

EngineArgs with device-optimized defaults

255

"""

256

```

257

258

## Environment Variables

259

260

vLLM respects numerous environment variables for configuration:

261

262

```python

263

# Key environment variables

264

VLLM_WORKER_MULTIPROC_METHOD # Worker process method

265

VLLM_USE_MODELSCOPE # Use ModelScope for model downloads

266

VLLM_TARGET_DEVICE # Override target device

267

VLLM_GPU_MEMORY_UTILIZATION # Default GPU memory usage

268

VLLM_HOST # Server host address

269

VLLM_PORT # Server port

270

VLLM_USE_RAY_COMPILED_DAG # Use compiled Ray DAGs

271

```