Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models
The audio transcription API converts audio files to text using Azure OpenAI's Whisper model. It supports multiple output formats including plain text, JSON, SRT, and VTT subtitles, with optional word and segment-level timestamps.
import org.springframework.ai.azure.openai.AzureOpenAiAudioTranscriptionModel;
import org.springframework.ai.azure.openai.AzureOpenAiAudioTranscriptionOptions;
import org.springframework.ai.azure.openai.metadata.AzureOpenAiAudioTranscriptionResponseMetadata;
import org.springframework.ai.audio.transcription.AudioTranscriptionPrompt;
import org.springframework.ai.audio.transcription.AudioTranscriptionResponse;
import org.springframework.core.io.Resource;
import org.springframework.core.io.FileSystemResource;
import org.springframework.core.io.ClassPathResource;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.core.credential.AzureKeyCredential;
import com.azure.ai.openai.models.AudioTranscriptionFormat;
import com.azure.ai.openai.models.AudioTranscriptionTimestampGranularity;The main class for audio transcription operations.
Thread-Safe: AzureOpenAiAudioTranscriptionModel is fully thread-safe and can be safely used across multiple threads concurrently. A single instance can handle multiple concurrent transcription requests.
Recommendation: Create one instance and reuse it across your application rather than creating new instances for each request.
class AzureOpenAiAudioTranscriptionModel implements TranscriptionModel {
AzureOpenAiAudioTranscriptionModel(
OpenAIClient openAIClient,
AzureOpenAiAudioTranscriptionOptions options
);
}Parameters:
openAIClient: Azure OpenAI client instance (required, non-null, throws NullPointerException if null)options: Audio transcription options (required, non-null, throws NullPointerException if null)Example:
OpenAIClient openAIClient = new OpenAIClientBuilder()
.credential(new AzureKeyCredential(apiKey))
.endpoint(endpoint)
.buildClient();
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.deploymentName("whisper")
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.JSON)
.build();
AzureOpenAiAudioTranscriptionModel transcriptionModel =
new AzureOpenAiAudioTranscriptionModel(openAIClient, options);String call(Resource audioResource);Transcribe an audio file and return the text as a string.
Parameters:
audioResource: The audio file resource (non-null, throws NullPointerException if null)Returns: String containing the transcription text (never null, may be empty for silent audio)
Throws:
HttpResponseException: HTTP errors from Azure API (400, 401, 403, 429, 500)ResourceNotFoundException: Deployment not found (404)NonTransientAiException: Permanent failures (invalid file format, file too large)TransientAiException: Temporary failures (rate limits, timeouts)NullPointerException: If audioResource is nullIllegalArgumentException: If file doesn't exist or is not readableSupported Audio Formats:
File Size Limit: 25 MB maximum
Example:
Resource audioFile = new FileSystemResource("interview.mp3");
String transcription = transcriptionModel.call(audioFile);
System.out.println("Transcription: " + transcription);Example with Different File Types:
// MP3 file
Resource mp3File = new FileSystemResource("audio.mp3");
String mp3Transcription = transcriptionModel.call(mp3File);
// WAV file
Resource wavFile = new FileSystemResource("recording.wav");
String wavTranscription = transcriptionModel.call(wavFile);
// M4A file
Resource m4aFile = new FileSystemResource("voice-memo.m4a");
String m4aTranscription = transcriptionModel.call(m4aFile);Error Handling:
try {
String transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
if (e.getResponse().getStatusCode() == 400) {
if (e.getMessage().contains("file size")) {
throw new FileTooLargeException("Audio file exceeds 25MB limit", e);
} else if (e.getMessage().contains("format")) {
throw new UnsupportedFormatException("Unsupported audio format", e);
}
} else if (e.getResponse().getStatusCode() == 429) {
throw new RateLimitException("Rate limit exceeded", e);
}
} catch (IllegalArgumentException e) {
throw new InvalidFileException("Audio file not found or not readable", e);
}AudioTranscriptionResponse call(AudioTranscriptionPrompt prompt);Transcribe audio and return a full response with metadata.
Parameters:
prompt: The transcription prompt containing audio and optional options (non-null, throws NullPointerException if null)Returns: AudioTranscriptionResponse containing transcription and metadata (never null)
Throws:
call(Resource) methodExample:
Resource audioFile = new FileSystemResource("podcast.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);
String text = response.getResult().getOutput();
System.out.println("Transcription: " + text);Example with Options:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en")
.temperature(0.0f)
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
.build();
Resource audioFile = new FileSystemResource("lecture.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);Configuration class for audio transcription requests.
class AzureOpenAiAudioTranscriptionOptions implements AudioTranscriptionOptions {
static Builder builder();
}public static final String DEFAULT_AUDIO_TRANSCRIPTION_MODEL = "whisper";The default model used for audio transcription.
class Builder {
Builder model(String model);
Builder deploymentName(String deploymentName);
Builder language(String language);
Builder prompt(String prompt);
Builder responseFormat(TranscriptResponseFormat responseFormat);
Builder temperature(Float temperature);
Builder granularityType(List<GranularityType> granularityType);
AzureOpenAiAudioTranscriptionOptions build();
}Builder Methods:
this for fluent chaining (never null)build(): Returns non-null AzureOpenAiAudioTranscriptionOptions instanceString getModel();
void setModel(String model);
String getDeploymentName();
void setDeploymentName(String deploymentName);Specifies which Whisper deployment to use.
Constraints:
Example:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.deploymentName("whisper")
.build();String getLanguage();
void setLanguage(String language);The language of the audio in ISO-639-1 format. Providing the language improves accuracy and latency.
Constraints:
Common Languages:
"en" - English"es" - Spanish"fr" - French"de" - German"it" - Italian"pt" - Portuguese"nl" - Dutch"pl" - Polish"ru" - Russian"ja" - Japanese"zh" - Chinese"ko" - Korean"ar" - Arabic"hi" - HindiExample:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en")
.build();String getPrompt();
void setPrompt(String prompt);Optional text to guide the model's style or continue a previous audio segment. Can include specific terminology or proper nouns.
Constraints:
Use Cases:
Example:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.prompt("This is a technical discussion about Spring AI and Azure OpenAI.")
.build();Example - Technical Terms:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.prompt("Keywords: Kubernetes, microservices, Docker, CI/CD, Spring Boot, Maven")
.language("en")
.build();Float getTemperature();
void setTemperature(Float temperature);Sampling temperature between 0 and 1. Lower values make output more focused and deterministic.
Constraints:
Guidelines:
Example:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.temperature(0.0f) // Most deterministic
.build();TranscriptResponseFormat getResponseFormat();
void setResponseFormat(TranscriptResponseFormat responseFormat);Format of the transcription output.
Constraints:
granularityType only valid with VERBOSE_JSON formatExample:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
.build();List<GranularityType> getGranularityType();
void setGranularityType(List<GranularityType> granularityType);Timestamp granularities for the transcription. Only applicable with VERBOSE_JSON format.
Constraints:
Example:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
.granularityType(List.of(
AzureOpenAiAudioTranscriptionOptions.GranularityType.WORD,
AzureOpenAiAudioTranscriptionOptions.GranularityType.SEGMENT
))
.build();enum WhisperModel {
WHISPER("whisper");
String getValue();
}enum TranscriptResponseFormat {
JSON,
TEXT,
SRT,
VERBOSE_JSON,
VTT;
AudioTranscriptionFormat getValue();
Class<?> getResponseType();
}Format Descriptions:
JSON: Simple JSON with text only - {"text": "transcription..."}TEXT: Plain text string (no formatting)SRT: SubRip subtitle format (timecoded subtitles)VERBOSE_JSON: Detailed JSON with metadata, timestamps, and segmentsVTT: WebVTT subtitle format (web-compatible timecoded subtitles)Example:
// Plain text
TranscriptResponseFormat.TEXT
// JSON with text only
TranscriptResponseFormat.JSON
// Detailed JSON with timestamps
TranscriptResponseFormat.VERBOSE_JSON
// Subtitle formats
TranscriptResponseFormat.SRT
TranscriptResponseFormat.VTTenum GranularityType {
WORD,
SEGMENT;
AudioTranscriptionTimestampGranularity getValue();
}WORD: Word-level timestamps (precise timing for each word)SEGMENT: Segment-level timestamps (timing for sentence/phrase segments)record StructuredResponse(
String language,
Float duration,
String text,
List<Word> words,
List<Segment> segments
) {}Detailed response structure for VERBOSE_JSON format.
Fields:
language: Detected language code (ISO-639-1, e.g., "en")duration: Audio duration in seconds (float)text: Full transcription text (non-null, may be empty)words: Word-level timestamps (null if not requested, empty list if no speech detected)segments: Segment-level data (null if not requested, empty list if no speech detected)record Word(
String word,
Float start,
Float end
) {}Word-level timestamp information.
Fields:
word: The word text (non-null, includes punctuation)start: Start time in seconds (float, >= 0.0)end: End time in seconds (float, > start)Example Usage:
for (Word word : structuredResponse.words()) {
System.out.printf("%s [%.2f - %.2f]%n",
word.word(), word.start(), word.end());
}record Segment(
Integer id,
Integer seek,
Float start,
Float end,
String text,
List<Integer> tokens,
Float temperature,
Float avgLogprob,
Float compressionRatio,
Float noSpeechProb
) {}Segment-level detailed information.
Fields:
id: Segment identifier (sequential, starts at 0)seek: Seek position in audio (internal Whisper parameter)start: Start time in seconds (float, >= 0.0)end: End time in seconds (float, > start)text: Segment text (non-null, may be empty)tokens: Token IDs (list of integers, never null)temperature: Temperature used for this segment (0.0-1.0)avgLogprob: Average log probability (negative float, higher = more confident)compressionRatio: Compression ratio (positive float, higher = more compressed/repetitive)noSpeechProb: No-speech probability (0.0-1.0, higher = likely silence/noise)Quality Indicators:
avgLogprob > -0.5: High confidenceavgLogprob < -1.0: Low confidence, may need reviewnoSpeechProb > 0.8: Likely silence or noise, not actual speechcompressionRatio > 2.4: Possible repetition or hallucinationMetadata for transcription responses.
class AzureOpenAiAudioTranscriptionResponseMetadata
extends AudioTranscriptionResponseMetadata {
static final AzureOpenAiAudioTranscriptionResponseMetadata NULL;
static AzureOpenAiAudioTranscriptionResponseMetadata from(StructuredResponse result);
static AzureOpenAiAudioTranscriptionResponseMetadata from(String result);
}Static Methods:
from(StructuredResponse): Create metadata from VERBOSE_JSON response (returns non-null)from(String): Create metadata from text/JSON response (returns non-null)NULL: Singleton instance representing no metadataOpenAIClient client = new OpenAIClientBuilder()
.credential(new AzureKeyCredential(apiKey))
.endpoint(endpoint)
.buildClient();
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.deploymentName("whisper")
.build();
AzureOpenAiAudioTranscriptionModel model =
new AzureOpenAiAudioTranscriptionModel(client, options);
Resource audioFile = new FileSystemResource("meeting.mp3");
String transcription = model.call(audioFile);
System.out.println(transcription);AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.deploymentName("whisper")
.language("en")
.build();
AzureOpenAiAudioTranscriptionModel model =
new AzureOpenAiAudioTranscriptionModel(client, options);
Resource audioFile = new FileSystemResource("english-podcast.mp3");
String transcription = model.call(audioFile);AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.JSON)
.build();
Resource audioFile = new FileSystemResource("audio.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);
String text = response.getResult().getOutput();AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
.granularityType(List.of(
AzureOpenAiAudioTranscriptionOptions.GranularityType.WORD,
AzureOpenAiAudioTranscriptionOptions.GranularityType.SEGMENT
))
.build();
Resource audioFile = new FileSystemResource("lecture.mp3");
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, options);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);
// Access structured response
String fullText = response.getResult().getOutput();
// If you need to parse the structured data, you'll need to handle the JSON responseAzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.SRT)
.build();
Resource videoAudio = new FileSystemResource("video-audio.mp3");
String srtSubtitles = transcriptionModel.call(videoAudio);
// Save to .srt file
Files.writeString(Path.of("subtitles.srt"), srtSubtitles);AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(AzureOpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VTT)
.build();
Resource videoAudio = new FileSystemResource("video.mp3");
String vttSubtitles = transcriptionModel.call(videoAudio);
// Save to .vtt file
Files.writeString(Path.of("subtitles.vtt"), vttSubtitles);AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.prompt("This recording discusses Spring Framework, Azure OpenAI, and Kubernetes deployment.")
.language("en")
.build();
Resource audioFile = new FileSystemResource("technical-talk.mp3");
String transcription = transcriptionModel.call(audioFile);AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.temperature(0.0f) // Most deterministic
.language("en")
.build();
Resource audioFile = new FileSystemResource("interview.mp3");
String transcription = transcriptionModel.call(audioFile);// Spanish audio
AzureOpenAiAudioTranscriptionOptions spanishOptions =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("es")
.build();
Resource spanishAudio = new FileSystemResource("spanish-podcast.mp3");
String spanishTranscription = transcriptionModel.call(spanishAudio);
// French audio
AzureOpenAiAudioTranscriptionOptions frenchOptions =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("fr")
.build();
Resource frenchAudio = new FileSystemResource("french-interview.mp3");
String frenchTranscription = transcriptionModel.call(frenchAudio);// Load audio file from classpath
Resource audioFile = new ClassPathResource("audio/sample.mp3");
String transcription = transcriptionModel.call(audioFile);List<String> audioFiles = List.of(
"recording1.mp3",
"recording2.mp3",
"recording3.mp3"
);
for (String fileName : audioFiles) {
Resource audioFile = new FileSystemResource(fileName);
String transcription = transcriptionModel.call(audioFile);
System.out.println("File: " + fileName);
System.out.println("Transcription: " + transcription);
System.out.println("---");
}AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile);
AudioTranscriptionResponse response = transcriptionModel.call(prompt);
String transcription = response.getResult().getOutput();
// Access metadata
if (response.getMetadata() instanceof AzureOpenAiAudioTranscriptionResponseMetadata metadata) {
// Metadata available
}This is a sample transcription of the audio file. The model will return plain text without any additional metadata or formatting.{
"text": "This is a sample transcription of the audio file."
}{
"language": "en",
"duration": 12.5,
"text": "This is a sample transcription.",
"words": [
{"word": "This", "start": 0.0, "end": 0.2},
{"word": "is", "start": 0.2, "end": 0.3},
{"word": "a", "start": 0.3, "end": 0.4}
],
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.0,
"text": "This is a sample transcription.",
"tokens": [123, 456, 789],
"temperature": 0.0,
"avg_logprob": -0.5,
"compression_ratio": 1.2,
"no_speech_prob": 0.01
}
]
}1
00:00:00,000 --> 00:00:05,000
This is a sample transcription.
2
00:00:05,000 --> 00:00:10,000
The model generates subtitles in SRT format.WEBVTT
00:00:00.000 --> 00:00:05.000
This is a sample transcription.
00:00:05.000 --> 00:00:10.000
The model generates subtitles in WebVTT format.// Azure SDK exceptions
com.azure.core.exception.HttpResponseException // HTTP errors (400, 401, 403, 429, 500)
com.azure.core.exception.ResourceNotFoundException // Deployment not found (404)
// Spring AI exceptions
org.springframework.ai.retry.NonTransientAiException // Permanent failures
org.springframework.ai.retry.TransientAiException // Temporary failures (retry-able)
// Java exceptions
java.lang.IllegalArgumentException // Invalid file or parameters
java.lang.NullPointerException // Null required parameters1. File Too Large (400):
try {
transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
if (e.getResponse().getStatusCode() == 400 &&
e.getMessage().contains("file size")) {
// File exceeds 25MB - split into smaller chunks
List<Resource> chunks = splitAudioFile(audioFile, 20 * 1024 * 1024);
StringBuilder fullTranscription = new StringBuilder();
for (Resource chunk : chunks) {
fullTranscription.append(transcriptionModel.call(chunk)).append(" ");
}
}
}2. Unsupported Format (400):
try {
transcription = transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
if (e.getResponse().getStatusCode() == 400 &&
e.getMessage().contains("format")) {
throw new UnsupportedFormatException(
"Audio format not supported. Use: mp3, mp4, mpeg, mpga, m4a, wav, webm", e
);
}
}3. Rate Limiting (429):
public String transcribeWithRetry(Resource audioFile) {
int maxRetries = 3;
int baseDelayMs = 1000;
for (int attempt = 0; attempt < maxRetries; attempt++) {
try {
return transcriptionModel.call(audioFile);
} catch (HttpResponseException e) {
if (e.getResponse().getStatusCode() == 429 && attempt < maxRetries - 1) {
int delayMs = baseDelayMs * (1 << attempt);
Thread.sleep(delayMs);
continue;
}
throw e;
}
}
throw new RuntimeException("Max retries exceeded");
}4. File Not Found:
try {
Resource audioFile = new FileSystemResource("nonexistent.mp3");
transcription = transcriptionModel.call(audioFile);
} catch (IllegalArgumentException e) {
throw new InvalidFileException("Audio file not found: " + audioFile, e);
}Deployment Name:
Language:
Prompt:
Temperature:
Response Format:
Granularity Type:
Audio File:
Always specify the language when known for better accuracy:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en") // Always specify if known
.build();Benefits:
Use prompts to improve accuracy for technical terms:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.prompt("Technical terms: Kubernetes, microservices, Docker, CI/CD, Spring Boot")
.build();Effective Prompt Strategies:
Use low temperature for consistent results:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.temperature(0.0f) // Deterministic
.build();When to Adjust:
The maximum file size is 25 MB. For larger files, split them into chunks:
public List<String> transcribeLargeFile(File audioFile) {
if (audioFile.length() > 25 * 1024 * 1024) {
// Split into 20MB chunks with overlap
List<Resource> chunks = splitAudioFile(audioFile, 20 * 1024 * 1024);
List<String> transcriptions = new ArrayList<>();
String previousPrompt = null;
for (Resource chunk : chunks) {
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.prompt(previousPrompt) // Use previous text for continuity
.build();
String transcription = transcriptionModel.call(chunk);
transcriptions.add(transcription);
// Use last sentence as prompt for next chunk
previousPrompt = getLastSentence(transcription);
}
return transcriptions;
} else {
return List.of(transcriptionModel.call(new FileSystemResource(audioFile)));
}
}Before Transcription:
Handling Poor Quality Audio:
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en") // Helps with poor audio
.temperature(0.2f) // Slightly higher for difficult audio
.prompt("Context about the audio topic") // Provides guidance
.build();Recommended:
// Create once at application startup
@Bean
public AzureOpenAiAudioTranscriptionModel transcriptionModel() {
return new AzureOpenAiAudioTranscriptionModel(client, options);
}
// Inject and reuse
@Autowired
private AzureOpenAiAudioTranscriptionModel transcriptionModel;Avoid:
// Don't create new instance per request
for (Resource audio : audioFiles) {
AzureOpenAiAudioTranscriptionModel model = new AzureOpenAiAudioTranscriptionModel(...);
model.call(audio); // Inefficient
}ExecutorService executor = Executors.newFixedThreadPool(5);
List<CompletableFuture<String>> futures = new ArrayList<>();
for (Resource audioFile : audioFiles) {
CompletableFuture<String> future = CompletableFuture.supplyAsync(
() -> transcriptionModel.call(audioFile),
executor
);
futures.add(future);
}
// Wait for all transcriptions
List<String> transcriptions = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());Approximate processing times (varies by audio quality and length):
| Format | Extension | Quality | Recommended For |
|---|---|---|---|
| mp3 | .mp3 | Good | General purpose, good balance of size/quality |
| mp4 | .mp4, .m4a | Good-Excellent | Mobile recordings, voice memos |
| wav | .wav | Excellent | Best quality, larger files |
| webm | .webm | Good | Web recordings, browser-based capture |
| mpeg | .mpeg, .mpga | Good | Legacy audio files |
Recommendations:
Resource meetingRecording = new FileSystemResource("team-meeting.mp3");
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en")
.temperature(0.0f)
.responseFormat(TranscriptResponseFormat.TEXT)
.build();
String transcription = transcriptionModel.call(meetingRecording);
// Process transcription for meeting notesAzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(TranscriptResponseFormat.TEXT)
.language("en")
.build();
Resource podcast = new FileSystemResource("episode-42.mp3");
String transcription = transcriptionModel.call(podcast);
// Publish as blog post or show notesAzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(TranscriptResponseFormat.SRT)
.language("en")
.build();
Resource videoAudio = extractAudioFromVideo("video.mp4");
String subtitles = transcriptionModel.call(videoAudio);
Files.writeString(Path.of("video.srt"), subtitles);Resource voiceNote = new FileSystemResource("voice-memo.m4a");
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en")
.temperature(0.0f)
.build();
String transcription = transcriptionModel.call(voiceNote);
// Save to notes app or databaseAzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
.granularityType(List.of(GranularityType.SEGMENT))
.language("en")
.prompt("Customer service call with technical support terms")
.build();
Resource callRecording = new FileSystemResource("call-12345.wav");
AudioTranscriptionResponse response = transcriptionModel.call(
new AudioTranscriptionPrompt(callRecording, options)
);
// Parse segments for speaker diarization or sentiment analysisSymptoms: Incorrect words, missing content, gibberish
Solutions:
noSpeechProb in segments (high values = poor audio)Symptoms: Transcription shorter than expected
Solutions:
noSpeechProb in segments (may be detecting silence)Symptoms: Transcription includes content not in audio
Solutions:
compressionRatio (> 2.4 indicates possible hallucination)Symptoms: Transcription takes longer than expected
Solutions:
Symptoms: Wrong language transcribed
Solution: Always specify language explicitly
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.language("en") // Don't rely on auto-detection
.build();tessl i tessl/maven-org-springframework-ai--spring-ai-azure-openai@1.1.1