or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-agents.mdguardrails.mdhandoffs.mdindex.mditems-streaming.mdlifecycle.mdmcp.mdmemory-sessions.mdmodel-providers.mdrealtime.mdresults-exceptions.mdtools.mdtracing.mdvoice-pipeline.md

voice-pipeline.mddocs/

0

# Voice Pipeline

1

2

The Voice Pipeline provides a framework for building voice processing workflows with speech-to-text (STT) and text-to-speech (TTS) capabilities. It enables creating custom voice assistants with pluggable audio models.

3

4

## Overview

5

6

The Voice Pipeline provides:

7

- Modular STT and TTS components

8

- Custom model integration

9

- Audio processing pipelines

10

- Voice-based agent interactions

11

12

## Capabilities

13

14

### Voice Pipeline

15

16

Main pipeline class for voice processing.

17

18

```python { .api }

19

class VoicePipeline:

20

"""

21

Pipeline for voice processing.

22

23

Coordinates STT, agent processing, and TTS

24

for complete voice interaction workflows.

25

"""

26

```

27

28

Usage example:

29

30

```python

31

from agents.voice import VoicePipeline, STTModel, TTSModel

32

33

# Create pipeline with STT and TTS

34

pipeline = VoicePipeline(

35

stt_model=my_stt_model,

36

tts_model=my_tts_model,

37

agent=my_agent

38

)

39

40

# Process voice input

41

audio_output = await pipeline.process(audio_input)

42

```

43

44

### STT Model Interface

45

46

Interface for speech-to-text models.

47

48

```python { .api }

49

class STTModel:

50

"""

51

Speech-to-text model interface.

52

53

Implement this to integrate custom STT models.

54

"""

55

56

async def transcribe(

57

audio_data: bytes,

58

**kwargs

59

) -> str:

60

"""

61

Transcribe audio to text.

62

63

Parameters:

64

- audio_data: Raw audio bytes

65

- **kwargs: Additional parameters

66

67

Returns:

68

- str: Transcribed text

69

"""

70

```

71

72

Implementation example:

73

74

```python

75

from agents.voice import STTModel

76

77

class MySTTModel(STTModel):

78

"""Custom STT implementation."""

79

80

def __init__(self, model_name: str):

81

self.model_name = model_name

82

# Initialize your STT model

83

84

async def transcribe(self, audio_data: bytes, **kwargs) -> str:

85

# Call your STT API or model

86

result = await my_stt_api.transcribe(audio_data)

87

return result.text

88

89

# Use custom STT

90

stt_model = MySTTModel("my-stt-v1")

91

```

92

93

### TTS Model Interface

94

95

Interface for text-to-speech models.

96

97

```python { .api }

98

class TTSModel:

99

"""

100

Text-to-speech model interface.

101

102

Implement this to integrate custom TTS models.

103

"""

104

105

async def synthesize(

106

text: str,

107

**kwargs

108

) -> bytes:

109

"""

110

Synthesize text to audio.

111

112

Parameters:

113

- text: Text to synthesize

114

- **kwargs: Additional parameters (voice, rate, etc.)

115

116

Returns:

117

- bytes: Audio data

118

"""

119

```

120

121

Implementation example:

122

123

```python

124

from agents.voice import TTSModel

125

126

class MyTTSModel(TTSModel):

127

"""Custom TTS implementation."""

128

129

def __init__(self, voice_id: str):

130

self.voice_id = voice_id

131

# Initialize your TTS model

132

133

async def synthesize(self, text: str, **kwargs) -> bytes:

134

# Call your TTS API or model

135

audio = await my_tts_api.synthesize(

136

text=text,

137

voice=self.voice_id,

138

**kwargs

139

)

140

return audio.data

141

142

# Use custom TTS

143

tts_model = MyTTSModel("voice-001")

144

```

145

146

## Complete Voice Workflow

147

148

Building a complete voice assistant:

149

150

```python

151

from agents import Agent, function_tool

152

from agents.voice import VoicePipeline, STTModel, TTSModel

153

154

# Define tools

155

@function_tool

156

def get_weather(city: str) -> str:

157

"""Get weather for a city."""

158

return f"Weather in {city}: Sunny, 72°F"

159

160

# Create agent

161

agent = Agent(

162

name="Voice Assistant",

163

instructions="You are a voice assistant. Keep responses concise.",

164

tools=[get_weather]

165

)

166

167

# Create voice pipeline

168

pipeline = VoicePipeline(

169

stt_model=MySTTModel("stt-model"),

170

tts_model=MyTTSModel("voice-001"),

171

agent=agent

172

)

173

174

# Process voice input

175

async def handle_voice_input(audio_input: bytes):

176

"""Process voice input and return voice output."""

177

audio_output = await pipeline.process(audio_input)

178

return audio_output

179

```

180

181

## OpenAI STT/TTS Integration

182

183

Using OpenAI's speech APIs:

184

185

```python

186

from openai import AsyncOpenAI

187

from agents.voice import STTModel, TTSModel

188

189

class OpenAISTT(STTModel):

190

"""OpenAI Whisper STT."""

191

192

def __init__(self):

193

self.client = AsyncOpenAI()

194

195

async def transcribe(self, audio_data: bytes, **kwargs) -> str:

196

# Use OpenAI Whisper

197

response = await self.client.audio.transcriptions.create(

198

model="whisper-1",

199

file=audio_data

200

)

201

return response.text

202

203

class OpenAITTS(TTSModel):

204

"""OpenAI TTS."""

205

206

def __init__(self, voice: str = "alloy"):

207

self.client = AsyncOpenAI()

208

self.voice = voice

209

210

async def synthesize(self, text: str, **kwargs) -> bytes:

211

# Use OpenAI TTS

212

response = await self.client.audio.speech.create(

213

model="tts-1",

214

voice=kwargs.get("voice", self.voice),

215

input=text

216

)

217

return response.content

218

219

# Use OpenAI models

220

pipeline = VoicePipeline(

221

stt_model=OpenAISTT(),

222

tts_model=OpenAITTS(voice="nova"),

223

agent=agent

224

)

225

```

226

227

## Audio Processing

228

229

Working with audio data:

230

231

```python

232

import io

233

from pydub import AudioSegment

234

235

async def process_audio_file(file_path: str):

236

"""Process audio file through voice pipeline."""

237

238

# Load audio file

239

audio = AudioSegment.from_file(file_path)

240

241

# Convert to required format (e.g., wav)

242

wav_buffer = io.BytesIO()

243

audio.export(wav_buffer, format="wav")

244

audio_data = wav_buffer.getvalue()

245

246

# Process through pipeline

247

output_audio = await pipeline.process(audio_data)

248

249

# Save output

250

output = AudioSegment.from_file(io.BytesIO(output_audio), format="wav")

251

output.export("output.mp3", format="mp3")

252

```

253

254

## Streaming Audio

255

256

For real-time streaming, consider using the Realtime API instead:

257

258

```python

259

# For streaming audio, use Realtime API

260

from agents.realtime import RealtimeAgent, RealtimeRunner

261

262

# Realtime API provides better streaming support

263

```

264

265

## Voice Configuration

266

267

Configuring voice pipeline options:

268

269

```python

270

pipeline = VoicePipeline(

271

stt_model=stt_model,

272

tts_model=tts_model,

273

agent=agent,

274

stt_params={

275

"language": "en",

276

"temperature": 0.0

277

},

278

tts_params={

279

"voice": "nova",

280

"speed": 1.0

281

}

282

)

283

```

284

285

## Best Practices

286

287

1. **Audio Format**: Use consistent audio formats (sample rate, channels, etc.)

288

2. **Model Selection**: Choose appropriate STT/TTS models for your use case

289

3. **Latency**: Minimize latency for better user experience

290

4. **Error Handling**: Handle audio processing errors gracefully

291

5. **Voice Selection**: Choose natural-sounding voices for TTS

292

6. **Concise Responses**: Keep agent responses brief for voice

293

7. **Testing**: Test with various audio inputs and accents

294

8. **Quality**: Monitor STT/TTS quality and adjust as needed

295

9. **Caching**: Cache TTS output for repeated phrases

296

10. **Streaming**: Use Realtime API for streaming scenarios

297

298

## Installation

299

300

Voice features require additional dependencies:

301

302

```bash

303

pip install 'openai-agents[voice]'

304

```

305

306

## Examples Location

307

308

Complete voice pipeline examples are available in the repository:

309

- `examples/voice/` - Voice pipeline examples

310

311

Refer to these examples for complete implementation details.

312

313

## Note

314

315

The Voice Pipeline is for batch/file-based voice processing. For real-time voice interactions (phone calls, live conversations), use the Realtime API instead.

316

317

Choose Voice Pipeline when you need:

318

- File-based audio processing

319

- Custom STT/TTS integration

320

- Batch voice processing

321

- Full control over audio pipeline

322

323

Choose Realtime API when you need:

324

- Real-time streaming audio

325

- Low-latency voice interactions

326

- Phone system integration

327

- Live conversational AI

328

329

For complete API reference and implementation details, refer to the source code and examples in the repository.

330