tessl/pypi-openai-agents

Lightweight framework for building multi-agent workflows with LLMs, supporting handoffs, guardrails, tools, and 100+ LLM providers

Overview

Eval results

Files

Voice Pipeline

Name: tessl/pypi-openai-agents
Author: tessl

The Voice Pipeline provides a framework for building voice processing workflows with speech-to-text (STT) and text-to-speech (TTS) capabilities. It enables creating custom voice assistants with pluggable audio models.

Overview

The Voice Pipeline provides:

Modular STT and TTS components
Custom model integration
Audio processing pipelines
Voice-based agent interactions

Capabilities

Voice Pipeline

Main pipeline class for voice processing.

class VoicePipeline:
    """
    Pipeline for voice processing.

    Coordinates STT, agent processing, and TTS
    for complete voice interaction workflows.
    """

Usage example:

from agents.voice import VoicePipeline, STTModel, TTSModel

# Create pipeline with STT and TTS
pipeline = VoicePipeline(
    stt_model=my_stt_model,
    tts_model=my_tts_model,
    agent=my_agent
)

# Process voice input
audio_output = await pipeline.process(audio_input)

STT Model Interface

Interface for speech-to-text models.

class STTModel:
    """
    Speech-to-text model interface.

    Implement this to integrate custom STT models.
    """

    async def transcribe(
        audio_data: bytes,
        **kwargs
    ) -> str:
        """
        Transcribe audio to text.

        Parameters:
        - audio_data: Raw audio bytes
        - **kwargs: Additional parameters

        Returns:
        - str: Transcribed text
        """

Implementation example:

from agents.voice import STTModel

class MySTTModel(STTModel):
    """Custom STT implementation."""

    def __init__(self, model_name: str):
        self.model_name = model_name
        # Initialize your STT model

    async def transcribe(self, audio_data: bytes, **kwargs) -> str:
        # Call your STT API or model
        result = await my_stt_api.transcribe(audio_data)
        return result.text

# Use custom STT
stt_model = MySTTModel("my-stt-v1")

TTS Model Interface

Interface for text-to-speech models.

class TTSModel:
    """
    Text-to-speech model interface.

    Implement this to integrate custom TTS models.
    """

    async def synthesize(
        text: str,
        **kwargs
    ) -> bytes:
        """
        Synthesize text to audio.

        Parameters:
        - text: Text to synthesize
        - **kwargs: Additional parameters (voice, rate, etc.)

        Returns:
        - bytes: Audio data
        """

Implementation example:

from agents.voice import TTSModel

class MyTTSModel(TTSModel):
    """Custom TTS implementation."""

    def __init__(self, voice_id: str):
        self.voice_id = voice_id
        # Initialize your TTS model

    async def synthesize(self, text: str, **kwargs) -> bytes:
        # Call your TTS API or model
        audio = await my_tts_api.synthesize(
            text=text,
            voice=self.voice_id,
            **kwargs
        )
        return audio.data

# Use custom TTS
tts_model = MyTTSModel("voice-001")

Complete Voice Workflow

Building a complete voice assistant:

from agents import Agent, function_tool
from agents.voice import VoicePipeline, STTModel, TTSModel

# Define tools
@function_tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: Sunny, 72°F"

# Create agent
agent = Agent(
    name="Voice Assistant",
    instructions="You are a voice assistant. Keep responses concise.",
    tools=[get_weather]
)

# Create voice pipeline
pipeline = VoicePipeline(
    stt_model=MySTTModel("stt-model"),
    tts_model=MyTTSModel("voice-001"),
    agent=agent
)

# Process voice input
async def handle_voice_input(audio_input: bytes):
    """Process voice input and return voice output."""
    audio_output = await pipeline.process(audio_input)
    return audio_output

OpenAI STT/TTS Integration

Using OpenAI's speech APIs:

from openai import AsyncOpenAI
from agents.voice import STTModel, TTSModel

class OpenAISTT(STTModel):
    """OpenAI Whisper STT."""

    def __init__(self):
        self.client = AsyncOpenAI()

    async def transcribe(self, audio_data: bytes, **kwargs) -> str:
        # Use OpenAI Whisper
        response = await self.client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_data
        )
        return response.text

class OpenAITTS(TTSModel):
    """OpenAI TTS."""

    def __init__(self, voice: str = "alloy"):
        self.client = AsyncOpenAI()
        self.voice = voice

    async def synthesize(self, text: str, **kwargs) -> bytes:
        # Use OpenAI TTS
        response = await self.client.audio.speech.create(
            model="tts-1",
            voice=kwargs.get("voice", self.voice),
            input=text
        )
        return response.content

# Use OpenAI models
pipeline = VoicePipeline(
    stt_model=OpenAISTT(),
    tts_model=OpenAITTS(voice="nova"),
    agent=agent
)

Audio Processing

Working with audio data:

import io
from pydub import AudioSegment

async def process_audio_file(file_path: str):
    """Process audio file through voice pipeline."""

    # Load audio file
    audio = AudioSegment.from_file(file_path)

    # Convert to required format (e.g., wav)
    wav_buffer = io.BytesIO()
    audio.export(wav_buffer, format="wav")
    audio_data = wav_buffer.getvalue()

    # Process through pipeline
    output_audio = await pipeline.process(audio_data)

    # Save output
    output = AudioSegment.from_file(io.BytesIO(output_audio), format="wav")
    output.export("output.mp3", format="mp3")

Streaming Audio

For real-time streaming, consider using the Realtime API instead:

# For streaming audio, use Realtime API
from agents.realtime import RealtimeAgent, RealtimeRunner

# Realtime API provides better streaming support

Voice Configuration

Configuring voice pipeline options:

pipeline = VoicePipeline(
    stt_model=stt_model,
    tts_model=tts_model,
    agent=agent,
    stt_params={
        "language": "en",
        "temperature": 0.0
    },
    tts_params={
        "voice": "nova",
        "speed": 1.0
    }
)

Best Practices

Audio Format: Use consistent audio formats (sample rate, channels, etc.)
Model Selection: Choose appropriate STT/TTS models for your use case
Latency: Minimize latency for better user experience
Error Handling: Handle audio processing errors gracefully
Voice Selection: Choose natural-sounding voices for TTS
Concise Responses: Keep agent responses brief for voice
Testing: Test with various audio inputs and accents
Quality: Monitor STT/TTS quality and adjust as needed
Caching: Cache TTS output for repeated phrases
Streaming: Use Realtime API for streaming scenarios

Installation

Voice features require additional dependencies:

pip install 'openai-agents[voice]'

Examples Location

Complete voice pipeline examples are available in the repository:

examples/voice/ - Voice pipeline examples

Refer to these examples for complete implementation details.

Note

The Voice Pipeline is for batch/file-based voice processing. For real-time voice interactions (phone calls, live conversations), use the Realtime API instead.

Choose Voice Pipeline when you need: