or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

async-inference.mdchat-completions.mdconfiguration.mdindex.mdparameters-types.mdtext-classification.mdtext-embeddings.mdtext-generation.mdtext-scoring.md

text-generation.mddocs/

0

# Text Generation

1

2

Primary text generation functionality in vLLM, providing high-throughput inference with intelligent batching and memory optimization. Supports various prompt formats, sampling strategies, and advanced features like guided decoding and structured output generation.

3

4

## Capabilities

5

6

### Generate Text

7

8

Main method for generating text from prompts using the LLM. Supports batch processing, various sampling parameters, and advanced features like LoRA adapters and guided decoding.

9

10

```python { .api }

11

def generate(

12

self,

13

prompts: Union[PromptType, Sequence[PromptType]],

14

sampling_params: Optional[Union[SamplingParams, Sequence[SamplingParams]]] = None,

15

*,

16

use_tqdm: Union[bool, Callable[..., tqdm]] = True,

17

lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,

18

priority: Optional[List[int]] = None

19

) -> List[RequestOutput]:

20

"""

21

Generate text from prompts using the language model.

22

23

Parameters:

24

- prompts: Single prompt or sequence of prompts (str, TextPrompt, TokensPrompt, or EmbedsPrompt)

25

- sampling_params: Parameters controlling generation behavior (temperature, top_p, etc.)

26

- use_tqdm: Whether to show progress bar for batch processing (keyword-only)

27

- lora_request: LoRA adapter request for fine-tuned model variants (keyword-only)

28

- priority: Priority levels for requests in batching (keyword-only)

29

30

Returns:

31

List of RequestOutput objects containing generated text and metadata

32

"""

33

```

34

35

### Beam Search Generation

36

37

Generate text using beam search for exploring multiple generation paths and finding high-quality outputs through systematic search.

38

39

```python { .api }

40

def beam_search(

41

self,

42

prompts: Union[PromptType, Sequence[PromptType]],

43

params: BeamSearchParams

44

) -> List[BeamSearchOutput]:

45

"""

46

Generate text using beam search algorithm.

47

48

Parameters:

49

- prompts: Input prompts for generation

50

- params: Beam search parameters (beam_width, length_penalty, etc.)

51

52

Returns:

53

List of BeamSearchOutput objects with multiple candidate sequences

54

"""

55

```

56

57

### Guided Decoding

58

59

Generate structured output following specific patterns like JSON schemas, regular expressions, or context-free grammars.

60

61

```python { .api }

62

# Used through SamplingParams.guided_decoding

63

class GuidedDecodingParams:

64

json: Optional[Union[str, dict]] = None

65

regex: Optional[str] = None

66

choice: Optional[list[str]] = None

67

grammar: Optional[str] = None

68

json_object: Optional[bool] = None

69

backend: Optional[str] = None

70

whitespace_pattern: Optional[str] = None

71

```

72

73

## Usage Examples

74

75

### Basic Text Generation

76

77

```python

78

from vllm import LLM, SamplingParams

79

80

# Initialize model

81

llm = LLM(model="microsoft/DialoGPT-medium")

82

83

# Configure sampling

84

sampling_params = SamplingParams(

85

temperature=0.8,

86

top_p=0.95,

87

max_tokens=100

88

)

89

90

# Generate text

91

prompts = ["The future of AI is", "Once upon a time"]

92

outputs = llm.generate(prompts, sampling_params)

93

94

for output in outputs:

95

print(f"Prompt: {output.prompt}")

96

print(f"Generated: {output.outputs[0].text}")

97

```

98

99

### Guided JSON Generation

100

101

```python

102

from vllm import LLM, SamplingParams

103

from vllm.sampling_params import GuidedDecodingParams

104

105

llm = LLM(model="microsoft/DialoGPT-medium")

106

107

# Define JSON schema

108

json_schema = {

109

"type": "object",

110

"properties": {

111

"name": {"type": "string"},

112

"age": {"type": "integer"},

113

"city": {"type": "string"}

114

},

115

"required": ["name", "age", "city"]

116

}

117

118

# Configure guided decoding

119

guided_params = GuidedDecodingParams(json=json_schema)

120

sampling_params = SamplingParams(

121

temperature=0.7,

122

max_tokens=150,

123

guided_decoding=guided_params

124

)

125

126

prompt = "Generate a person's information:"

127

outputs = llm.generate(prompt, sampling_params)

128

print(outputs[0].outputs[0].text) # Valid JSON output

129

```

130

131

### Batch Generation with Different Parameters

132

133

```python

134

from vllm import LLM, SamplingParams

135

136

llm = LLM(model="microsoft/DialoGPT-medium")

137

138

prompts = ["Creative story:", "Technical explanation:", "Casual conversation:"]

139

140

# Different sampling parameters for each prompt

141

sampling_params = [

142

SamplingParams(temperature=1.2, top_p=0.9), # Creative

143

SamplingParams(temperature=0.3, top_p=0.95), # Technical

144

SamplingParams(temperature=0.8, top_p=0.9) # Casual

145

]

146

147

outputs = llm.generate(prompts, sampling_params)

148

for output in outputs:

149

print(f"{output.prompt} -> {output.outputs[0].text}")

150

```

151

152

### Using Pre-tokenized Input

153

154

```python

155

from vllm import LLM, SamplingParams

156

157

llm = LLM(model="microsoft/DialoGPT-medium")

158

159

# Pre-tokenize input (useful for custom tokenization)

160

prompt_token_ids = [[1, 2, 3, 4, 5]] # Your tokenized input

161

sampling_params = SamplingParams(temperature=0.8)

162

163

outputs = llm.generate(

164

prompts=[""], # Empty string when using token IDs

165

prompt_token_ids=prompt_token_ids,

166

sampling_params=sampling_params

167

)

168

169

print(outputs[0].outputs[0].text)

170

```

171

172

## Types

173

174

```python { .api }

175

class RequestOutput:

176

request_id: str

177

prompt: Optional[str]

178

prompt_token_ids: list[int]

179

prompt_logprobs: Optional[PromptLogprobs]

180

outputs: list[CompletionOutput]

181

finished: bool

182

metrics: Optional[RequestMetrics]

183

lora_request: Optional[LoRARequest]

184

185

class CompletionOutput:

186

index: int

187

text: str

188

token_ids: list[int]

189

cumulative_logprob: Optional[float]

190

logprobs: Optional[SampleLogprobs]

191

finish_reason: Optional[str] # "stop", "length", "abort"

192

stop_reason: Union[int, str, None] # Specific stop token/string

193

lora_request: Optional[LoRARequest]

194

195

class BeamSearchOutput:

196

sequences: list[BeamSearchSequence]

197

finished: bool

198

199

class BeamSearchSequence:

200

text: str

201

token_ids: list[int]

202

cumulative_logprob: float

203

```