Tessl Tile for pypi/vllm@0.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

async-inference.md chat-completions.md configuration.md index.md parameters-types.md text-classification.md text-embeddings.md text-generation.md text-scoring.md

configuration.mddocs/

0
# Configuration
1

2
Comprehensive configuration system for vLLM engine initialization, model loading, distributed execution, and performance optimization. Controls everything from basic model selection to advanced deployment scenarios across multiple GPUs and nodes.
3

4
## Capabilities
5

6
### Engine Arguments
7

8
Primary configuration class for initializing LLM engines with extensive options for model, tokenizer, execution, and performance settings.
9

10
```python { .api }
11
class EngineArgs:
12
    # Model Configuration
13
    model: str  # HuggingFace model name or local path
14
    tokenizer: Optional[str] = None  # Tokenizer name/path (defaults to model)
15
    tokenizer_mode: str = "auto"  # "auto", "slow", or "fast"
16
    revision: Optional[str] = None  # Model revision/branch
17
    code_revision: Optional[str] = None  # Code revision for remote code
18
    tokenizer_revision: Optional[str] = None  # Tokenizer revision
19
    trust_remote_code: bool = False  # Execute remote code
20
    download_dir: Optional[str] = None  # Model download directory
21
    load_format: str = "auto"  # Model loading format
22
    config_format: str = "auto"  # Config loading format
23

24
    # Model Execution
25
    dtype: str = "auto"  # Model precision ("auto", "half", "float16", "bfloat16", "float32")
26
    kv_cache_dtype: str = "auto"  # KV cache data type
27
    quantization_param_path: Optional[str] = None  # Quantization parameters
28
    quantization: Optional[str] = None  # Quantization method ("awq", "gptq", "squeezellm", "fp8")
29

30
    # Memory and Performance
31
    gpu_memory_utilization: float = 0.9  # GPU memory usage fraction
32
    swap_space: int = 4  # CPU swap space in GiB
33
    cpu_offload_gb: float = 0  # CPU offload memory in GB
34
    max_model_len: Optional[int] = None  # Maximum sequence length
35
    max_num_batched_tokens: Optional[int] = None  # Maximum batch size in tokens
36
    max_num_seqs: int = 256  # Maximum concurrent sequences
37
    max_logprobs: int = 20  # Maximum logprobs to return
38

39
    # Parallelism and Distribution
40
    tensor_parallel_size: int = 1  # Tensor parallelism degree
41
    pipeline_parallel_size: int = 1  # Pipeline parallelism degree
42
    max_parallel_loading_workers: Optional[int] = None  # Model loading workers
43
    ray_workers_use_nsight: bool = False  # Enable Nsight profiling
44
    block_size: int = 16  # Attention block size
45
    enable_prefix_caching: bool = False  # Enable prefix caching
46
    disable_custom_all_reduce: bool = False  # Disable custom all-reduce
47

48
    # Advanced Options
49
    preemption_mode: Optional[str] = None  # Preemption strategy
50
    num_lookahead_slots: int = 0  # Speculative decoding slots
51
    seed: int = 0  # Random seed
52
    num_gpu_blocks_override: Optional[int] = None  # Override GPU block count
53
    max_seq_len_to_capture: int = 8192  # Maximum sequence length for CUDA graphs
54
    disable_sliding_window: bool = False  # Disable sliding window attention
55

56
    # Multimodal Support
57
    image_input_type: Optional[str] = None  # Image input format
58
    image_token_id: Optional[int] = None  # Image token ID
59
    image_input_shape: Optional[str] = None  # Image input dimensions
60
    image_feature_size: Optional[int] = None  # Image feature size
61
    scheduler_delay_factor: float = 0.0  # Scheduler delay factor
62
    enable_chunked_prefill: Optional[bool] = None  # Chunked prefill optimization
63
```
64

65
### Async Engine Arguments
66

67
Extended configuration for asynchronous inference engines with additional options for concurrent request handling and streaming.
68

69
```python { .api }
70
class AsyncEngineArgs(EngineArgs):
71
    # Inherits all EngineArgs options plus:
72
    worker_use_ray: bool = False  # Use Ray for distributed workers
73
    engine_use_ray: bool = False  # Use Ray for engine management
74
    disable_log_requests: bool = False  # Disable request logging
75
    max_log_len: Optional[int] = None  # Maximum log length
76
```
77

78
### Model Configuration Options
79

80
Specialized configuration options for different model types and architectures.
81

82
```python { .api }
83
# Model Data Types
84
class ModelDType(str, Enum):
85
    AUTO = "auto"
86
    HALF = "half"
87
    FLOAT16 = "float16"
88
    BFLOAT16 = "bfloat16"
89
    FLOAT32 = "float32"
90

91
# Quantization Methods
92
class QuantizationMethods(str, Enum):
93
    AWQ = "awq"
94
    GPTQ = "gptq"
95
    SQUEEZELLM = "squeezellm"
96
    FP8 = "fp8"
97
    BITSANDBYTES = "bitsandbytes"
98

99
# Load Formats
100
class LoadFormats(str, Enum):
101
    AUTO = "auto"
102
    PT = "pt"
103
    SAFETENSORS = "safetensors"
104
    NPCACHE = "npcache"
105
    DUMMY = "dummy"
106
```
107

108
### Device and Platform Configuration
109

110
Configuration options for different hardware platforms and deployment environments.
111

112
```python { .api }
113
class Device(str, Enum):
114
    GPU = "gpu"
115
    CPU = "cpu"
116
    TPU = "tpu"
117
    XPU = "xpu"
118

119
class DeviceConfig:
120
    device: Device  # Target device type
121
    device_ids: Optional[List[int]] = None  # Specific device IDs
122
    placement_group: Optional[str] = None  # Ray placement group
123
```
124

125
## Usage Examples
126

127
### Basic GPU Configuration
128

129
```python
130
from vllm import LLM, EngineArgs
131

132
# Simple GPU setup
133
args = EngineArgs(
134
    model="microsoft/DialoGPT-medium",
135
    tensor_parallel_size=1,
136
    gpu_memory_utilization=0.8,
137
    max_model_len=2048
138
)
139

140
llm = LLM(**args.to_dict())
141
```
142

143
### Multi-GPU Distributed Setup
144

145
```python
146
from vllm import LLM, EngineArgs
147

148
# Multi-GPU configuration
149
args = EngineArgs(
150
    model="microsoft/DialoGPT-large",
151
    tensor_parallel_size=4,  # Use 4 GPUs
152
    pipeline_parallel_size=2,  # Pipeline across 2 stages
153
    gpu_memory_utilization=0.9,
154
    max_model_len=4096,
155
    trust_remote_code=True
156
)
157

158
llm = LLM(**args.to_dict())
159
```
160

161
### Quantized Model Configuration
162

163
```python
164
from vllm import LLM, EngineArgs
165

166
# AWQ quantized model
167
args = EngineArgs(
168
    model="microsoft/DialoGPT-medium-awq",
169
    quantization="awq",
170
    dtype="half",
171
    gpu_memory_utilization=0.95,
172
    max_model_len=8192
173
)
174

175
llm = LLM(**args.to_dict())
176
```
177

178
### Async Engine Configuration
179

180
```python
181
from vllm import AsyncLLMEngine, AsyncEngineArgs
182

183
# Async engine with Ray
184
async_args = AsyncEngineArgs(
185
    model="microsoft/DialoGPT-medium",
186
    worker_use_ray=True,
187
    engine_use_ray=True,
188
    tensor_parallel_size=2,
189
    max_num_seqs=128,
190
    gpu_memory_utilization=0.9
191
)
192

193
engine = AsyncLLMEngine.from_engine_args(async_args)
194
```
195

196
### Memory-Optimized Configuration
197

198
```python
199
from vllm import LLM, EngineArgs
200

201
# Optimize for memory efficiency
202
args = EngineArgs(
203
    model="microsoft/DialoGPT-small",
204
    gpu_memory_utilization=0.95,
205
    swap_space=8,  # 8GB CPU swap
206
    cpu_offload_gb=2,  # Offload 2GB to CPU
207
    max_num_batched_tokens=1024,  # Smaller batches
208
    enable_prefix_caching=True,  # Cache common prefixes
209
    block_size=8  # Smaller attention blocks
210
)
211

212
llm = LLM(**args.to_dict())
213
```
214

215
### Development and Debugging Configuration
216

217
```python
218
from vllm import LLM, EngineArgs
219

220
# Development setup with detailed logging
221
args = EngineArgs(
222
    model="microsoft/DialoGPT-medium",
223
    tensor_parallel_size=1,
224
    seed=42,  # Reproducible results
225
    disable_custom_all_reduce=True,  # Use standard operations
226
    max_logprobs=10,  # Detailed probability info
227
    trust_remote_code=True,  # For custom models
228
    revision="main"  # Specific model version
229
)
230

231
llm = LLM(**args.to_dict())
232
```
233

234
## Configuration Validation
235

236
```python { .api }
237
def validate_config(args: EngineArgs) -> None:
238
    """
239
    Validate engine configuration parameters.
240

241
    Raises:
242
    ValueError: If configuration is invalid
243
    RuntimeError: If hardware requirements are not met
244
    """
245

246
def get_default_config_for_device(device: Device) -> EngineArgs:
247
    """
248
    Get recommended default configuration for target device.
249

250
    Parameters:
251
    - device: Target deployment device
252

253
    Returns:
254
    EngineArgs with device-optimized defaults
255
    """
256
```
257

258
## Environment Variables
259

260
vLLM respects numerous environment variables for configuration:
261

262
```python
263
# Key environment variables
264
VLLM_WORKER_MULTIPROC_METHOD  # Worker process method
265
VLLM_USE_MODELSCOPE  # Use ModelScope for model downloads
266
VLLM_TARGET_DEVICE  # Override target device
267
VLLM_GPU_MEMORY_UTILIZATION  # Default GPU memory usage
268
VLLM_HOST  # Server host address
269
VLLM_PORT  # Server port
270
VLLM_USE_RAY_COMPILED_DAG  # Use compiled Ray DAGs
271
```

Version

Tile

Files

configuration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

configuration.mddocs/