CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-azure-openai

Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models

Overview
Eval results
Files

audio-api.mddocs/reference/

Audio Transcription

The audio transcription API converts audio files to text using Azure OpenAI's Whisper model. It supports multiple output formats including plain text, JSON, SRT, and VTT subtitles, with optional word and segment-level timestamps.

Imports

import org.springframework.ai.azure.openai.AzureOpenAiAudioTranscriptionModel;
import org.springframework.ai.azure.openai.AzureOpenAiAudioTranscriptionOptions;
import org.springframework.ai.azure.openai.metadata.AzureOpenAiAudioTranscriptionResponseMetadata;
import org.springframework.ai.audio.transcription.AudioTranscriptionPrompt;
import org.springframework.ai.audio.transcription.AudioTranscriptionResponse;
import org.springframework.core.io.Resource;
import org.springframework.core.io.FileSystemResource;
import org.springframework.core.io.ClassPathResource;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.core.credential.AzureKeyCredential;
import com.azure.ai.openai.models.AudioTranscriptionFormat;
import com.azure.ai.openai.models.AudioTranscriptionTimestampGranularity;

AzureOpenAiAudioTranscriptionModel

The main class for audio transcription operations.

Thread Safety

Thread-Safe: AzureOpenAiAudioTranscriptionModel is fully thread-safe and can be safely used across multiple threads concurrently. A single instance can handle multiple concurrent transcription requests.

Recommendation: Create one instance and reuse it across your application rather than creating new instances for each request.

Construction

class AzureOpenAiAudioTranscriptionModel implements TranscriptionModel {
    AzureOpenAiAudioTranscriptionModel(
        OpenAIClient openAIClient,
        AzureOpenAiAudioTranscriptionOptions options
    );
}

Parameters:

  • openAIClient: Azure OpenAI client instance (required, non-null, throws NullPointerException if null)
  • options: Audio transcription options (required, non-null, throws NullPointerException if null)

Example:

OpenAIClient openAIClient = new OpenAIClientBuilder()
    .credential(new AzureKeyCredential(apiKey))
    .endpoint(endpoint)
    .buildClient();

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .deploymentName("whisper")
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.JSON)
        .build();

AzureOpenAiAudioTranscriptionModel transcriptionModel =
    new AzureOpenAiAudioTranscriptionModel(openAIClient, options);

Core Methods

Transcribe Audio (Simple)

String call(Resource audioResource);

Transcribe an audio file and return the text as a string.

Parameters:

  • audioResource: The audio file resource (non-null, throws NullPointerException if null)

Returns: String containing the transcription text (never null, may be empty for silent audio)

Throws:

  • HttpResponseException: HTTP errors from Azure API (400, 401, 403, 429, 500)
  • ResourceNotFoundException: Deployment not found (404)
  • NonTransientAiException: Permanent failures (invalid file format, file too large)
  • TransientAiException: Temporary failures (rate limits, timeouts)
  • NullPointerException: If audioResource is null
  • IllegalArgumentException: If file doesn't exist or is not readable

Supported Audio Formats:

  • mp3, mp4, mpeg, mpga, m4a, wav, webm

File Size Limit: 25 MB maximum

Example:

Resource audioFile = new FileSystemResource("interview.mp3");
String transcription = transcriptionModel.call(audioFile);
System.out.println("Transcription: " + transcription);

Example with Different File Types:

// MP3 file
Resource mp3File = new FileSystemResource("audio.mp3");
String mp3Transcription = transcriptionModel.call(mp3File);

// WAV file
Resource wavFile = new FileSystemResource("recording.wav");
String wavTranscription = transcriptionModel.call(wavFile);

// M4A file
Resource m4aFile = new FileSystemResource("voice-memo.m4a");
String m4aTranscription = transcriptionModel.call(m4aFile);

Error Handling:

try {
    String transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 400) {
        if (e.getMessage().contains("file size")) {
            throw new FileTooLargeException("Audio file exceeds 25MB limit", e);
        } else if (e.getMessage().contains("format")) {
            throw new UnsupportedFormatException("Unsupported audio format", e);
        }
    } else if (e.getResponse().getStatusCode() == 429) {
        throw new RateLimitException("Rate limit exceeded", e);
    }
} catch (IllegalArgumentException e) {
    throw new InvalidFileException("Audio file not found or not readable", e);
}

Transcribe Audio (Full Response)

AudioTranscriptionResponse call(AudioTranscriptionPrompt prompt);

Transcribe audio and return a full response with metadata.

Parameters:

  • prompt: The transcription prompt containing audio and optional options (non-null, throws NullPointerException if null)

Returns: AudioTranscriptionResponse containing transcription and metadata (never null)

Throws:

  • Same exceptions as call(Resource) method

Example:

Resource audioFile = new FileSystemResource("podcast.mp3");

AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);

String text = response.getResult().getOutput();
System.out.println("Transcription: " + text);

Example with Options:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")
        .temperature(0.0f)
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
        .build();

Resource audioFile = new FileSystemResource("lecture.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);

AzureOpenAiAudioTranscriptionOptions

Configuration class for audio transcription requests.

Construction

class AzureOpenAiAudioTranscriptionOptions implements AudioTranscriptionOptions {
    static Builder builder();
}

Constants

public static final String DEFAULT_AUDIO_TRANSCRIPTION_MODEL = "whisper";

The default model used for audio transcription.

Builder

class Builder {
    Builder model(String model);
    Builder deploymentName(String deploymentName);
    Builder language(String language);
    Builder prompt(String prompt);
    Builder responseFormat(TranscriptResponseFormat responseFormat);
    Builder temperature(Float temperature);
    Builder granularityType(List<GranularityType> granularityType);
    AzureOpenAiAudioTranscriptionOptions build();
}

Builder Methods:

  • All builder methods return this for fluent chaining (never null)
  • All parameters are optional (can be null)
  • build(): Returns non-null AzureOpenAiAudioTranscriptionOptions instance

Properties

Model / Deployment Name

String getModel();
void setModel(String model);
String getDeploymentName();
void setDeploymentName(String deploymentName);

Specifies which Whisper deployment to use.

Constraints:

  • Cannot be null or empty (throws IllegalArgumentException)
  • Must match an existing deployment in your Azure OpenAI resource
  • Default: "whisper"

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .deploymentName("whisper")
        .build();

Language

String getLanguage();
void setLanguage(String language);

The language of the audio in ISO-639-1 format. Providing the language improves accuracy and latency.

Constraints:

  • Must be valid ISO-639-1 language code (2 characters)
  • Optional (model auto-detects if not specified)
  • Improves accuracy by 10-15% when specified correctly

Common Languages:

  • "en" - English
  • "es" - Spanish
  • "fr" - French
  • "de" - German
  • "it" - Italian
  • "pt" - Portuguese
  • "nl" - Dutch
  • "pl" - Polish
  • "ru" - Russian
  • "ja" - Japanese
  • "zh" - Chinese
  • "ko" - Korean
  • "ar" - Arabic
  • "hi" - Hindi

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")
        .build();

Prompt

String getPrompt();
void setPrompt(String prompt);

Optional text to guide the model's style or continue a previous audio segment. Can include specific terminology or proper nouns.

Constraints:

  • Optional (can be null or empty)
  • Max length: approximately 224 tokens
  • Should be in same language as audio
  • Case-sensitive

Use Cases:

  • Provide context for technical terms
  • Specify proper nouns, names, or brands
  • Maintain consistency across multiple audio segments
  • Improve accuracy for domain-specific vocabulary

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .prompt("This is a technical discussion about Spring AI and Azure OpenAI.")
        .build();

Example - Technical Terms:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .prompt("Keywords: Kubernetes, microservices, Docker, CI/CD, Spring Boot, Maven")
        .language("en")
        .build();

Temperature

Float getTemperature();
void setTemperature(Float temperature);

Sampling temperature between 0 and 1. Lower values make output more focused and deterministic.

Constraints:

  • Range: 0.0 to 1.0 (throws IllegalArgumentException if out of range)
  • Default: 0.0 (most deterministic)
  • Type: Float (nullable)

Guidelines:

  • 0.0: Most deterministic, best for consistent transcriptions (recommended for most use cases)
  • 0.1-0.3: Slightly more variation, can help with difficult audio
  • 0.4-1.0: More random, rarely useful for transcription

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .temperature(0.0f)  // Most deterministic
        .build();

Response Format

TranscriptResponseFormat getResponseFormat();
void setResponseFormat(TranscriptResponseFormat responseFormat);

Format of the transcription output.

Constraints:

  • Default: JSON
  • granularityType only valid with VERBOSE_JSON format

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
        .build();

Granularity Type

List<GranularityType> getGranularityType();
void setGranularityType(List<GranularityType> granularityType);

Timestamp granularities for the transcription. Only applicable with VERBOSE_JSON format.

Constraints:

  • Only works with VERBOSE_JSON response format (ignored otherwise)
  • Can specify multiple granularity types
  • Optional (no timestamps if not specified)

Example:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
        .granularityType(List.of(
            AzureOpenAiAudioTranscriptionOptions.GranularityType.WORD,
            AzureOpenAiAudioTranscriptionOptions.GranularityType.SEGMENT
        ))
        .build();

Enums

WhisperModel

enum WhisperModel {
    WHISPER("whisper");

    String getValue();
}

TranscriptResponseFormat

enum TranscriptResponseFormat {
    JSON,
    TEXT,
    SRT,
    VERBOSE_JSON,
    VTT;

    AudioTranscriptionFormat getValue();
    Class<?> getResponseType();
}

Format Descriptions:

  • JSON: Simple JSON with text only - {"text": "transcription..."}
  • TEXT: Plain text string (no formatting)
  • SRT: SubRip subtitle format (timecoded subtitles)
  • VERBOSE_JSON: Detailed JSON with metadata, timestamps, and segments
  • VTT: WebVTT subtitle format (web-compatible timecoded subtitles)

Example:

// Plain text
TranscriptResponseFormat.TEXT

// JSON with text only
TranscriptResponseFormat.JSON

// Detailed JSON with timestamps
TranscriptResponseFormat.VERBOSE_JSON

// Subtitle formats
TranscriptResponseFormat.SRT
TranscriptResponseFormat.VTT

GranularityType

enum GranularityType {
    WORD,
    SEGMENT;

    AudioTranscriptionTimestampGranularity getValue();
}
  • WORD: Word-level timestamps (precise timing for each word)
  • SEGMENT: Segment-level timestamps (timing for sentence/phrase segments)

Nested Types

StructuredResponse

record StructuredResponse(
    String language,
    Float duration,
    String text,
    List<Word> words,
    List<Segment> segments
) {}

Detailed response structure for VERBOSE_JSON format.

Fields:

  • language: Detected language code (ISO-639-1, e.g., "en")
  • duration: Audio duration in seconds (float)
  • text: Full transcription text (non-null, may be empty)
  • words: Word-level timestamps (null if not requested, empty list if no speech detected)
  • segments: Segment-level data (null if not requested, empty list if no speech detected)

Word

record Word(
    String word,
    Float start,
    Float end
) {}

Word-level timestamp information.

Fields:

  • word: The word text (non-null, includes punctuation)
  • start: Start time in seconds (float, >= 0.0)
  • end: End time in seconds (float, > start)

Example Usage:

for (Word word : structuredResponse.words()) {
    System.out.printf("%s [%.2f - %.2f]%n", 
        word.word(), word.start(), word.end());
}

Segment

record Segment(
    Integer id,
    Integer seek,
    Float start,
    Float end,
    String text,
    List<Integer> tokens,
    Float temperature,
    Float avgLogprob,
    Float compressionRatio,
    Float noSpeechProb
) {}

Segment-level detailed information.

Fields:

  • id: Segment identifier (sequential, starts at 0)
  • seek: Seek position in audio (internal Whisper parameter)
  • start: Start time in seconds (float, >= 0.0)
  • end: End time in seconds (float, > start)
  • text: Segment text (non-null, may be empty)
  • tokens: Token IDs (list of integers, never null)
  • temperature: Temperature used for this segment (0.0-1.0)
  • avgLogprob: Average log probability (negative float, higher = more confident)
  • compressionRatio: Compression ratio (positive float, higher = more compressed/repetitive)
  • noSpeechProb: No-speech probability (0.0-1.0, higher = likely silence/noise)

Quality Indicators:

  • avgLogprob > -0.5: High confidence
  • avgLogprob < -1.0: Low confidence, may need review
  • noSpeechProb > 0.8: Likely silence or noise, not actual speech
  • compressionRatio > 2.4: Possible repetition or hallucination

AzureOpenAiAudioTranscriptionResponseMetadata

Metadata for transcription responses.

class AzureOpenAiAudioTranscriptionResponseMetadata
        extends AudioTranscriptionResponseMetadata {

    static final AzureOpenAiAudioTranscriptionResponseMetadata NULL;

    static AzureOpenAiAudioTranscriptionResponseMetadata from(StructuredResponse result);
    static AzureOpenAiAudioTranscriptionResponseMetadata from(String result);
}

Static Methods:

  • from(StructuredResponse): Create metadata from VERBOSE_JSON response (returns non-null)
  • from(String): Create metadata from text/JSON response (returns non-null)
  • NULL: Singleton instance representing no metadata

Usage Examples

Basic Transcription

OpenAIClient client = new OpenAIClientBuilder()
    .credential(new AzureKeyCredential(apiKey))
    .endpoint(endpoint)
    .buildClient();

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .deploymentName("whisper")
        .build();

AzureOpenAiAudioTranscriptionModel model =
    new AzureOpenAiAudioTranscriptionModel(client, options);

Resource audioFile = new FileSystemResource("meeting.mp3");
String transcription = model.call(audioFile);
System.out.println(transcription);

With Language Specification

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .deploymentName("whisper")
        .language("en")
        .build();

AzureOpenAiAudioTranscriptionModel model =
    new AzureOpenAiAudioTranscriptionModel(client, options);

Resource audioFile = new FileSystemResource("english-podcast.mp3");
String transcription = model.call(audioFile);

JSON Response Format

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.JSON)
        .build();

Resource audioFile = new FileSystemResource("audio.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);

String text = response.getResult().getOutput();

Verbose JSON with Timestamps

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
        .granularityType(List.of(
            AzureOpenAiAudioTranscriptionOptions.GranularityType.WORD,
            AzureOpenAiAudioTranscriptionOptions.GranularityType.SEGMENT
        ))
        .build();

Resource audioFile = new FileSystemResource("lecture.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);

// Access structured response
String fullText = response.getResult().getOutput();

// If you need to parse the structured data, you'll need to handle the JSON response

SRT Subtitle Generation

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.SRT)
        .build();

Resource videoAudio = new FileSystemResource("video-audio.mp3");
String srtSubtitles = transcriptionModel.call(videoAudio);

// Save to .srt file
Files.writeString(Path.of("subtitles.srt"), srtSubtitles);

VTT Subtitle Generation

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VTT)
        .build();

Resource videoAudio = new FileSystemResource("video.mp3");
String vttSubtitles = transcriptionModel.call(videoAudio);

// Save to .vtt file
Files.writeString(Path.of("subtitles.vtt"), vttSubtitles);

With Context Prompt

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .prompt("This recording discusses Spring Framework, Azure OpenAI, and Kubernetes deployment.")
        .language("en")
        .build();

Resource audioFile = new FileSystemResource("technical-talk.mp3");
String transcription = transcriptionModel.call(audioFile);

Deterministic Transcription

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .temperature(0.0f)  // Most deterministic
        .language("en")
        .build();

Resource audioFile = new FileSystemResource("interview.mp3");
String transcription = transcriptionModel.call(audioFile);

Multi-Language Audio

// Spanish audio
AzureOpenAiAudioTranscriptionOptions spanishOptions =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("es")
        .build();

Resource spanishAudio = new FileSystemResource("spanish-podcast.mp3");
String spanishTranscription = transcriptionModel.call(spanishAudio);

// French audio
AzureOpenAiAudioTranscriptionOptions frenchOptions =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("fr")
        .build();

Resource frenchAudio = new FileSystemResource("french-interview.mp3");
String frenchTranscription = transcriptionModel.call(frenchAudio);

From Classpath Resource

// Load audio file from classpath
Resource audioFile = new ClassPathResource("audio/sample.mp3");
String transcription = transcriptionModel.call(audioFile);

Processing Multiple Files

List<String> audioFiles = List.of(
    "recording1.mp3",
    "recording2.mp3",
    "recording3.mp3"
);

for (String fileName : audioFiles) {
    Resource audioFile = new FileSystemResource(fileName);
    String transcription = transcriptionModel.call(audioFile);
    System.out.println("File: " + fileName);
    System.out.println("Transcription: " + transcription);
    System.out.println("---");
}

Accessing Response Metadata

AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);

String transcription = response.getResult().getOutput();

// Access metadata
if (response.getMetadata() instanceof AzureOpenAiAudioTranscriptionResponseMetadata metadata) {
    // Metadata available
}

Response Format Examples

TEXT Format

This is a sample transcription of the audio file. The model will return plain text without any additional metadata or formatting.

JSON Format

{
  "text": "This is a sample transcription of the audio file."
}

VERBOSE_JSON Format

{
  "language": "en",
  "duration": 12.5,
  "text": "This is a sample transcription.",
  "words": [
    {"word": "This", "start": 0.0, "end": 0.2},
    {"word": "is", "start": 0.2, "end": 0.3},
    {"word": "a", "start": 0.3, "end": 0.4}
  ],
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 5.0,
      "text": "This is a sample transcription.",
      "tokens": [123, 456, 789],
      "temperature": 0.0,
      "avg_logprob": -0.5,
      "compression_ratio": 1.2,
      "no_speech_prob": 0.01
    }
  ]
}

SRT Format

1
00:00:00,000 --> 00:00:05,000
This is a sample transcription.

2
00:00:05,000 --> 00:00:10,000
The model generates subtitles in SRT format.

VTT Format

WEBVTT

00:00:00.000 --> 00:00:05.000
This is a sample transcription.

00:00:05.000 --> 00:00:10.000
The model generates subtitles in WebVTT format.

Error Handling

Common Exceptions

// Azure SDK exceptions
com.azure.core.exception.HttpResponseException  // HTTP errors (400, 401, 403, 429, 500)
com.azure.core.exception.ResourceNotFoundException  // Deployment not found (404)

// Spring AI exceptions
org.springframework.ai.retry.NonTransientAiException  // Permanent failures
org.springframework.ai.retry.TransientAiException  // Temporary failures (retry-able)

// Java exceptions
java.lang.IllegalArgumentException  // Invalid file or parameters
java.lang.NullPointerException  // Null required parameters

Exception Scenarios

1. File Too Large (400):

try {
    transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 400 && 
        e.getMessage().contains("file size")) {
        // File exceeds 25MB - split into smaller chunks
        List<Resource> chunks = splitAudioFile(audioFile, 20 * 1024 * 1024);
        StringBuilder fullTranscription = new StringBuilder();
        for (Resource chunk : chunks) {
            fullTranscription.append(transcriptionModel.call(chunk)).append(" ");
        }
    }
}

2. Unsupported Format (400):

try {
    transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 400 && 
        e.getMessage().contains("format")) {
        throw new UnsupportedFormatException(
            "Audio format not supported. Use: mp3, mp4, mpeg, mpga, m4a, wav, webm", e
        );
    }
}

3. Rate Limiting (429):

public String transcribeWithRetry(Resource audioFile) {
    int maxRetries = 3;
    int baseDelayMs = 1000;
    
    for (int attempt = 0; attempt < maxRetries; attempt++) {
        try {
            return transcriptionModel.call(audioFile);
        } catch (HttpResponseException e) {
            if (e.getResponse().getStatusCode() == 429 && attempt < maxRetries - 1) {
                int delayMs = baseDelayMs * (1 << attempt);
                Thread.sleep(delayMs);
                continue;
            }
            throw e;
        }
    }
    throw new RuntimeException("Max retries exceeded");
}

4. File Not Found:

try {
    Resource audioFile = new FileSystemResource("nonexistent.mp3");
    transcription = transcriptionModel.call(audioFile);
} catch (IllegalArgumentException e) {
    throw new InvalidFileException("Audio file not found: " + audioFile, e);
}

Validation Rules

Parameter Constraints Summary

Deployment Name:

  • Required: Yes (throws NullPointerException if null)
  • Default: "whisper"
  • Format: Non-empty string

Language:

  • Format: ISO-639-1 code (2 characters)
  • Optional (auto-detected if not specified)
  • Case-insensitive
  • Type: String (nullable)

Prompt:

  • Max length: ~224 tokens
  • Optional
  • Should be in same language as audio
  • Type: String (nullable)

Temperature:

  • Range: 0.0 to 1.0
  • Default: 0.0
  • Type: Float (nullable)

Response Format:

  • Values: TEXT, JSON, SRT, VTT, VERBOSE_JSON
  • Default: JSON
  • Type: TranscriptResponseFormat enum

Granularity Type:

  • Values: WORD, SEGMENT (can specify both)
  • Only valid with VERBOSE_JSON format
  • Type: List<GranularityType> (nullable)

Audio File:

  • Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
  • Max size: 25 MB
  • Must exist and be readable
  • Must be valid audio file

Best Practices

Language Specification

Always specify the language when known for better accuracy:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")  // Always specify if known
        .build();

Benefits:

  • 10-15% accuracy improvement
  • Faster processing
  • Better handling of accents and dialects

Context Prompts

Use prompts to improve accuracy for technical terms:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .prompt("Technical terms: Kubernetes, microservices, Docker, CI/CD, Spring Boot")
        .build();

Effective Prompt Strategies:

  • List technical terms, acronyms, product names
  • Include proper nouns (people names, company names)
  • Use same language as audio
  • Keep under 224 tokens

Temperature Control

Use low temperature for consistent results:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .temperature(0.0f)  // Deterministic
        .build();

When to Adjust:

  • 0.0: Standard use case (recommended)
  • 0.1-0.2: Difficult audio with heavy accents
  • Higher values rarely useful

File Size Limits

The maximum file size is 25 MB. For larger files, split them into chunks:

public List<String> transcribeLargeFile(File audioFile) {
    if (audioFile.length() > 25 * 1024 * 1024) {
        // Split into 20MB chunks with overlap
        List<Resource> chunks = splitAudioFile(audioFile, 20 * 1024 * 1024);
        List<String> transcriptions = new ArrayList<>();
        
        String previousPrompt = null;
        for (Resource chunk : chunks) {
            AzureOpenAiAudioTranscriptionOptions options =
                AzureOpenAiAudioTranscriptionOptions.builder()
                    .prompt(previousPrompt)  // Use previous text for continuity
                    .build();
            
            String transcription = transcriptionModel.call(chunk);
            transcriptions.add(transcription);
            
            // Use last sentence as prompt for next chunk
            previousPrompt = getLastSentence(transcription);
        }
        return transcriptions;
    } else {
        return List.of(transcriptionModel.call(new FileSystemResource(audioFile)));
    }
}

Audio Quality Optimization

Before Transcription:

  1. Normalize audio levels
  2. Remove background noise if possible
  3. Use lossless or high-bitrate formats (wav, high-quality mp3)
  4. Ensure single speaker or clear speaker separation

Handling Poor Quality Audio:

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")  // Helps with poor audio
        .temperature(0.2f)  // Slightly higher for difficult audio
        .prompt("Context about the audio topic")  // Provides guidance
        .build();

Performance Considerations

Model Instance Reuse

Recommended:

// Create once at application startup
@Bean
public AzureOpenAiAudioTranscriptionModel transcriptionModel() {
    return new AzureOpenAiAudioTranscriptionModel(client, options);
}

// Inject and reuse
@Autowired
private AzureOpenAiAudioTranscriptionModel transcriptionModel;

Avoid:

// Don't create new instance per request
for (Resource audio : audioFiles) {
    AzureOpenAiAudioTranscriptionModel model = new AzureOpenAiAudioTranscriptionModel(...);
    model.call(audio);  // Inefficient
}

Parallel Processing

ExecutorService executor = Executors.newFixedThreadPool(5);
List<CompletableFuture<String>> futures = new ArrayList<>();

for (Resource audioFile : audioFiles) {
    CompletableFuture<String> future = CompletableFuture.supplyAsync(
        () -> transcriptionModel.call(audioFile),
        executor
    );
    futures.add(future);
}

// Wait for all transcriptions
List<String> transcriptions = futures.stream()
    .map(CompletableFuture::join)
    .collect(Collectors.toList());

Processing Time Estimates

Approximate processing times (varies by audio quality and length):

  • Real-time factor: ~0.1x to 0.3x (e.g., 10 minutes audio = 1-3 minutes processing)
  • Factors affecting speed:
    • Audio quality (higher quality = faster)
    • Audio duration
    • Response format (VERBOSE_JSON slower than TEXT)
    • Specified language (specified = faster than auto-detect)
    • Network latency to Azure region

Supported Audio Formats

FormatExtensionQualityRecommended For
mp3.mp3GoodGeneral purpose, good balance of size/quality
mp4.mp4, .m4aGood-ExcellentMobile recordings, voice memos
wav.wavExcellentBest quality, larger files
webm.webmGoodWeb recordings, browser-based capture
mpeg.mpeg, .mpgaGoodLegacy audio files

Recommendations:

  • Best quality: wav (uncompressed)
  • Best balance: mp3 at 128kbps or higher
  • Mobile: m4a (native iOS/Android format)
  • Web: webm (native browser format)

Common Use Cases

Meeting Transcription

Resource meetingRecording = new FileSystemResource("team-meeting.mp3");

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")
        .temperature(0.0f)
        .responseFormat(TranscriptResponseFormat.TEXT)
        .build();

String transcription = transcriptionModel.call(meetingRecording);
// Process transcription for meeting notes

Podcast Transcription

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(TranscriptResponseFormat.TEXT)
        .language("en")
        .build();

Resource podcast = new FileSystemResource("episode-42.mp3");
String transcription = transcriptionModel.call(podcast);
// Publish as blog post or show notes

Video Subtitle Generation

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(TranscriptResponseFormat.SRT)
        .language("en")
        .build();

Resource videoAudio = extractAudioFromVideo("video.mp4");
String subtitles = transcriptionModel.call(videoAudio);
Files.writeString(Path.of("video.srt"), subtitles);

Voice Note Transcription

Resource voiceNote = new FileSystemResource("voice-memo.m4a");

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")
        .temperature(0.0f)
        .build();

String transcription = transcriptionModel.call(voiceNote);
// Save to notes app or database

Call Center Recording Analysis

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
        .granularityType(List.of(GranularityType.SEGMENT))
        .language("en")
        .prompt("Customer service call with technical support terms")
        .build();

Resource callRecording = new FileSystemResource("call-12345.wav");
AudioTranscriptionResponse response = transcriptionModel.call(
    new AudioTranscriptionPrompt(callRecording, options)
);

// Parse segments for speaker diarization or sentiment analysis

Troubleshooting

Issue: Poor transcription accuracy

Symptoms: Incorrect words, missing content, gibberish

Solutions:

  1. Specify language explicitly
  2. Use context prompt with technical terms
  3. Improve audio quality (remove noise, normalize levels)
  4. Ensure audio is audible (not too quiet)
  5. Use temperature 0.0 for consistency
  6. Check noSpeechProb in segments (high values = poor audio)

Issue: Missing or incomplete transcription

Symptoms: Transcription shorter than expected

Solutions:

  1. Check noSpeechProb in segments (may be detecting silence)
  2. Verify audio file integrity
  3. Increase audio volume/gain
  4. Check for long silences (may be truncating)

Issue: Hallucinations (non-existent content)

Symptoms: Transcription includes content not in audio

Solutions:

  1. Use temperature 0.0
  2. Check compressionRatio (> 2.4 indicates possible hallucination)
  3. Remove background music if present
  4. Ensure audio contains actual speech
  5. Use context prompt to guide topic

Issue: Slow processing

Symptoms: Transcription takes longer than expected

Solutions:

  1. Specify language (skip auto-detection)
  2. Use simpler response format (TEXT instead of VERBOSE_JSON)
  3. Reduce file size/duration
  4. Check network latency to Azure region
  5. Process files in parallel (up to 5-10 concurrent)

Issue: Incorrect language detection

Symptoms: Wrong language transcribed

Solution: Always specify language explicitly

AzureOpenAiAudioTranscriptionOptions options =
    AzureOpenAiAudioTranscriptionOptions.builder()
        .language("en")  // Don't rely on auto-detection
        .build();

Default Values

  • Model: "whisper"
  • Response Format: JSON
  • Temperature: 0.0 (default by Whisper)
  • Language: Auto-detected if not specified
  • Prompt: null (no context)
  • Granularity Type: null (no timestamps)
tessl i tessl/maven-org-springframework-ai--spring-ai-azure-openai@1.1.1

docs

index.md

tile.json