CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-openai-agents

Lightweight framework for building multi-agent workflows with LLMs, supporting handoffs, guardrails, tools, and 100+ LLM providers

Overview
Eval results
Files

voice-pipeline.mddocs/

Voice Pipeline

The Voice Pipeline provides a framework for building voice processing workflows with speech-to-text (STT) and text-to-speech (TTS) capabilities. It enables creating custom voice assistants with pluggable audio models.

Overview

The Voice Pipeline provides:

  • Modular STT and TTS components
  • Custom model integration
  • Audio processing pipelines
  • Voice-based agent interactions

Capabilities

Voice Pipeline

Main pipeline class for voice processing.

class VoicePipeline:
    """
    Pipeline for voice processing.

    Coordinates STT, agent processing, and TTS
    for complete voice interaction workflows.
    """

Usage example:

from agents.voice import VoicePipeline, STTModel, TTSModel

# Create pipeline with STT and TTS
pipeline = VoicePipeline(
    stt_model=my_stt_model,
    tts_model=my_tts_model,
    agent=my_agent
)

# Process voice input
audio_output = await pipeline.process(audio_input)

STT Model Interface

Interface for speech-to-text models.

class STTModel:
    """
    Speech-to-text model interface.

    Implement this to integrate custom STT models.
    """

    async def transcribe(
        audio_data: bytes,
        **kwargs
    ) -> str:
        """
        Transcribe audio to text.

        Parameters:
        - audio_data: Raw audio bytes
        - **kwargs: Additional parameters

        Returns:
        - str: Transcribed text
        """

Implementation example:

from agents.voice import STTModel

class MySTTModel(STTModel):
    """Custom STT implementation."""

    def __init__(self, model_name: str):
        self.model_name = model_name
        # Initialize your STT model

    async def transcribe(self, audio_data: bytes, **kwargs) -> str:
        # Call your STT API or model
        result = await my_stt_api.transcribe(audio_data)
        return result.text

# Use custom STT
stt_model = MySTTModel("my-stt-v1")

TTS Model Interface

Interface for text-to-speech models.

class TTSModel:
    """
    Text-to-speech model interface.

    Implement this to integrate custom TTS models.
    """

    async def synthesize(
        text: str,
        **kwargs
    ) -> bytes:
        """
        Synthesize text to audio.

        Parameters:
        - text: Text to synthesize
        - **kwargs: Additional parameters (voice, rate, etc.)

        Returns:
        - bytes: Audio data
        """

Implementation example:

from agents.voice import TTSModel

class MyTTSModel(TTSModel):
    """Custom TTS implementation."""

    def __init__(self, voice_id: str):
        self.voice_id = voice_id
        # Initialize your TTS model

    async def synthesize(self, text: str, **kwargs) -> bytes:
        # Call your TTS API or model
        audio = await my_tts_api.synthesize(
            text=text,
            voice=self.voice_id,
            **kwargs
        )
        return audio.data

# Use custom TTS
tts_model = MyTTSModel("voice-001")

Complete Voice Workflow

Building a complete voice assistant:

from agents import Agent, function_tool
from agents.voice import VoicePipeline, STTModel, TTSModel

# Define tools
@function_tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: Sunny, 72°F"

# Create agent
agent = Agent(
    name="Voice Assistant",
    instructions="You are a voice assistant. Keep responses concise.",
    tools=[get_weather]
)

# Create voice pipeline
pipeline = VoicePipeline(
    stt_model=MySTTModel("stt-model"),
    tts_model=MyTTSModel("voice-001"),
    agent=agent
)

# Process voice input
async def handle_voice_input(audio_input: bytes):
    """Process voice input and return voice output."""
    audio_output = await pipeline.process(audio_input)
    return audio_output

OpenAI STT/TTS Integration

Using OpenAI's speech APIs:

from openai import AsyncOpenAI
from agents.voice import STTModel, TTSModel

class OpenAISTT(STTModel):
    """OpenAI Whisper STT."""

    def __init__(self):
        self.client = AsyncOpenAI()

    async def transcribe(self, audio_data: bytes, **kwargs) -> str:
        # Use OpenAI Whisper
        response = await self.client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_data
        )
        return response.text

class OpenAITTS(TTSModel):
    """OpenAI TTS."""

    def __init__(self, voice: str = "alloy"):
        self.client = AsyncOpenAI()
        self.voice = voice

    async def synthesize(self, text: str, **kwargs) -> bytes:
        # Use OpenAI TTS
        response = await self.client.audio.speech.create(
            model="tts-1",
            voice=kwargs.get("voice", self.voice),
            input=text
        )
        return response.content

# Use OpenAI models
pipeline = VoicePipeline(
    stt_model=OpenAISTT(),
    tts_model=OpenAITTS(voice="nova"),
    agent=agent
)

Audio Processing

Working with audio data:

import io
from pydub import AudioSegment

async def process_audio_file(file_path: str):
    """Process audio file through voice pipeline."""

    # Load audio file
    audio = AudioSegment.from_file(file_path)

    # Convert to required format (e.g., wav)
    wav_buffer = io.BytesIO()
    audio.export(wav_buffer, format="wav")
    audio_data = wav_buffer.getvalue()

    # Process through pipeline
    output_audio = await pipeline.process(audio_data)

    # Save output
    output = AudioSegment.from_file(io.BytesIO(output_audio), format="wav")
    output.export("output.mp3", format="mp3")

Streaming Audio

For real-time streaming, consider using the Realtime API instead:

# For streaming audio, use Realtime API
from agents.realtime import RealtimeAgent, RealtimeRunner

# Realtime API provides better streaming support

Voice Configuration

Configuring voice pipeline options:

pipeline = VoicePipeline(
    stt_model=stt_model,
    tts_model=tts_model,
    agent=agent,
    stt_params={
        "language": "en",
        "temperature": 0.0
    },
    tts_params={
        "voice": "nova",
        "speed": 1.0
    }
)

Best Practices

  1. Audio Format: Use consistent audio formats (sample rate, channels, etc.)
  2. Model Selection: Choose appropriate STT/TTS models for your use case
  3. Latency: Minimize latency for better user experience
  4. Error Handling: Handle audio processing errors gracefully
  5. Voice Selection: Choose natural-sounding voices for TTS
  6. Concise Responses: Keep agent responses brief for voice
  7. Testing: Test with various audio inputs and accents
  8. Quality: Monitor STT/TTS quality and adjust as needed
  9. Caching: Cache TTS output for repeated phrases
  10. Streaming: Use Realtime API for streaming scenarios

Installation

Voice features require additional dependencies:

pip install 'openai-agents[voice]'

Examples Location

Complete voice pipeline examples are available in the repository:

  • examples/voice/ - Voice pipeline examples

Refer to these examples for complete implementation details.

Note

The Voice Pipeline is for batch/file-based voice processing. For real-time voice interactions (phone calls, live conversations), use the Realtime API instead.

Choose Voice Pipeline when you need:

  • File-based audio processing
  • Custom STT/TTS integration
  • Batch voice processing
  • Full control over audio pipeline

Choose Realtime API when you need:

  • Real-time streaming audio
  • Low-latency voice interactions
  • Phone system integration
  • Live conversational AI

For complete API reference and implementation details, refer to the source code and examples in the repository.

Install with Tessl CLI

npx tessl i tessl/pypi-openai-agents

docs

core-agents.md

guardrails.md

handoffs.md

index.md

items-streaming.md

lifecycle.md

mcp.md

memory-sessions.md

model-providers.md

realtime.md

results-exceptions.md

tools.md

tracing.md

voice-pipeline.md

tile.json