0
# Voice Pipeline
1
2
The Voice Pipeline provides a framework for building voice processing workflows with speech-to-text (STT) and text-to-speech (TTS) capabilities. It enables creating custom voice assistants with pluggable audio models.
3
4
## Overview
5
6
The Voice Pipeline provides:
7
- Modular STT and TTS components
8
- Custom model integration
9
- Audio processing pipelines
10
- Voice-based agent interactions
11
12
## Capabilities
13
14
### Voice Pipeline
15
16
Main pipeline class for voice processing.
17
18
```python { .api }
19
class VoicePipeline:
20
"""
21
Pipeline for voice processing.
22
23
Coordinates STT, agent processing, and TTS
24
for complete voice interaction workflows.
25
"""
26
```
27
28
Usage example:
29
30
```python
31
from agents.voice import VoicePipeline, STTModel, TTSModel
32
33
# Create pipeline with STT and TTS
34
pipeline = VoicePipeline(
35
stt_model=my_stt_model,
36
tts_model=my_tts_model,
37
agent=my_agent
38
)
39
40
# Process voice input
41
audio_output = await pipeline.process(audio_input)
42
```
43
44
### STT Model Interface
45
46
Interface for speech-to-text models.
47
48
```python { .api }
49
class STTModel:
50
"""
51
Speech-to-text model interface.
52
53
Implement this to integrate custom STT models.
54
"""
55
56
async def transcribe(
57
audio_data: bytes,
58
**kwargs
59
) -> str:
60
"""
61
Transcribe audio to text.
62
63
Parameters:
64
- audio_data: Raw audio bytes
65
- **kwargs: Additional parameters
66
67
Returns:
68
- str: Transcribed text
69
"""
70
```
71
72
Implementation example:
73
74
```python
75
from agents.voice import STTModel
76
77
class MySTTModel(STTModel):
78
"""Custom STT implementation."""
79
80
def __init__(self, model_name: str):
81
self.model_name = model_name
82
# Initialize your STT model
83
84
async def transcribe(self, audio_data: bytes, **kwargs) -> str:
85
# Call your STT API or model
86
result = await my_stt_api.transcribe(audio_data)
87
return result.text
88
89
# Use custom STT
90
stt_model = MySTTModel("my-stt-v1")
91
```
92
93
### TTS Model Interface
94
95
Interface for text-to-speech models.
96
97
```python { .api }
98
class TTSModel:
99
"""
100
Text-to-speech model interface.
101
102
Implement this to integrate custom TTS models.
103
"""
104
105
async def synthesize(
106
text: str,
107
**kwargs
108
) -> bytes:
109
"""
110
Synthesize text to audio.
111
112
Parameters:
113
- text: Text to synthesize
114
- **kwargs: Additional parameters (voice, rate, etc.)
115
116
Returns:
117
- bytes: Audio data
118
"""
119
```
120
121
Implementation example:
122
123
```python
124
from agents.voice import TTSModel
125
126
class MyTTSModel(TTSModel):
127
"""Custom TTS implementation."""
128
129
def __init__(self, voice_id: str):
130
self.voice_id = voice_id
131
# Initialize your TTS model
132
133
async def synthesize(self, text: str, **kwargs) -> bytes:
134
# Call your TTS API or model
135
audio = await my_tts_api.synthesize(
136
text=text,
137
voice=self.voice_id,
138
**kwargs
139
)
140
return audio.data
141
142
# Use custom TTS
143
tts_model = MyTTSModel("voice-001")
144
```
145
146
## Complete Voice Workflow
147
148
Building a complete voice assistant:
149
150
```python
151
from agents import Agent, function_tool
152
from agents.voice import VoicePipeline, STTModel, TTSModel
153
154
# Define tools
155
@function_tool
156
def get_weather(city: str) -> str:
157
"""Get weather for a city."""
158
return f"Weather in {city}: Sunny, 72°F"
159
160
# Create agent
161
agent = Agent(
162
name="Voice Assistant",
163
instructions="You are a voice assistant. Keep responses concise.",
164
tools=[get_weather]
165
)
166
167
# Create voice pipeline
168
pipeline = VoicePipeline(
169
stt_model=MySTTModel("stt-model"),
170
tts_model=MyTTSModel("voice-001"),
171
agent=agent
172
)
173
174
# Process voice input
175
async def handle_voice_input(audio_input: bytes):
176
"""Process voice input and return voice output."""
177
audio_output = await pipeline.process(audio_input)
178
return audio_output
179
```
180
181
## OpenAI STT/TTS Integration
182
183
Using OpenAI's speech APIs:
184
185
```python
186
from openai import AsyncOpenAI
187
from agents.voice import STTModel, TTSModel
188
189
class OpenAISTT(STTModel):
190
"""OpenAI Whisper STT."""
191
192
def __init__(self):
193
self.client = AsyncOpenAI()
194
195
async def transcribe(self, audio_data: bytes, **kwargs) -> str:
196
# Use OpenAI Whisper
197
response = await self.client.audio.transcriptions.create(
198
model="whisper-1",
199
file=audio_data
200
)
201
return response.text
202
203
class OpenAITTS(TTSModel):
204
"""OpenAI TTS."""
205
206
def __init__(self, voice: str = "alloy"):
207
self.client = AsyncOpenAI()
208
self.voice = voice
209
210
async def synthesize(self, text: str, **kwargs) -> bytes:
211
# Use OpenAI TTS
212
response = await self.client.audio.speech.create(
213
model="tts-1",
214
voice=kwargs.get("voice", self.voice),
215
input=text
216
)
217
return response.content
218
219
# Use OpenAI models
220
pipeline = VoicePipeline(
221
stt_model=OpenAISTT(),
222
tts_model=OpenAITTS(voice="nova"),
223
agent=agent
224
)
225
```
226
227
## Audio Processing
228
229
Working with audio data:
230
231
```python
232
import io
233
from pydub import AudioSegment
234
235
async def process_audio_file(file_path: str):
236
"""Process audio file through voice pipeline."""
237
238
# Load audio file
239
audio = AudioSegment.from_file(file_path)
240
241
# Convert to required format (e.g., wav)
242
wav_buffer = io.BytesIO()
243
audio.export(wav_buffer, format="wav")
244
audio_data = wav_buffer.getvalue()
245
246
# Process through pipeline
247
output_audio = await pipeline.process(audio_data)
248
249
# Save output
250
output = AudioSegment.from_file(io.BytesIO(output_audio), format="wav")
251
output.export("output.mp3", format="mp3")
252
```
253
254
## Streaming Audio
255
256
For real-time streaming, consider using the Realtime API instead:
257
258
```python
259
# For streaming audio, use Realtime API
260
from agents.realtime import RealtimeAgent, RealtimeRunner
261
262
# Realtime API provides better streaming support
263
```
264
265
## Voice Configuration
266
267
Configuring voice pipeline options:
268
269
```python
270
pipeline = VoicePipeline(
271
stt_model=stt_model,
272
tts_model=tts_model,
273
agent=agent,
274
stt_params={
275
"language": "en",
276
"temperature": 0.0
277
},
278
tts_params={
279
"voice": "nova",
280
"speed": 1.0
281
}
282
)
283
```
284
285
## Best Practices
286
287
1. **Audio Format**: Use consistent audio formats (sample rate, channels, etc.)
288
2. **Model Selection**: Choose appropriate STT/TTS models for your use case
289
3. **Latency**: Minimize latency for better user experience
290
4. **Error Handling**: Handle audio processing errors gracefully
291
5. **Voice Selection**: Choose natural-sounding voices for TTS
292
6. **Concise Responses**: Keep agent responses brief for voice
293
7. **Testing**: Test with various audio inputs and accents
294
8. **Quality**: Monitor STT/TTS quality and adjust as needed
295
9. **Caching**: Cache TTS output for repeated phrases
296
10. **Streaming**: Use Realtime API for streaming scenarios
297
298
## Installation
299
300
Voice features require additional dependencies:
301
302
```bash
303
pip install 'openai-agents[voice]'
304
```
305
306
## Examples Location
307
308
Complete voice pipeline examples are available in the repository:
309
- `examples/voice/` - Voice pipeline examples
310
311
Refer to these examples for complete implementation details.
312
313
## Note
314
315
The Voice Pipeline is for batch/file-based voice processing. For real-time voice interactions (phone calls, live conversations), use the Realtime API instead.
316
317
Choose Voice Pipeline when you need:
318
- File-based audio processing
319
- Custom STT/TTS integration
320
- Batch voice processing
321
- Full control over audio pipeline
322
323
Choose Realtime API when you need:
324
- Real-time streaming audio
325
- Low-latency voice interactions
326
- Phone system integration
327
- Live conversational AI
328
329
For complete API reference and implementation details, refer to the source code and examples in the repository.
330