CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-open-ai

LangChain4j OpenAI Integration providing Java access to OpenAI APIs including chat models, embeddings, image generation, audio transcription, and moderation.

Overview
Eval results
Files

audio-transcription-models.mddocs/

Audio Transcription Models

Audio transcription models convert spoken audio to written text using OpenAI's Whisper and GPT-4o audio models. Supports multiple audio formats, languages, and optional speaker diarization for identifying different speakers in the audio.

The transcription API is ideal for meeting transcripts, podcast notes, voice memos, and accessibility features. Advanced models provide enhanced accuracy and speaker identification capabilities.

Capabilities

OpenAiAudioTranscriptionModel

Experimental synchronous audio transcription model that converts audio files to text. Supports various audio formats and optional parameters for language and prompts.

@Experimental
public class OpenAiAudioTranscriptionModel implements AudioTranscriptionModel {
    public static Builder builder();

    // Core transcription method
    public AudioTranscriptionResponse transcribe(AudioTranscriptionRequest audioRequest);

    // Model information
    public ModelProvider provider();
}

Builder

Builder for configuring OpenAiAudioTranscriptionModel instances.

public static class Builder {
    // Core configuration
    public Builder apiKey(String apiKey);
    public Builder baseUrl(String baseUrl);
    public Builder organizationId(String organizationId);
    public Builder projectId(String projectId);
    public Builder modelName(String modelName);
    public Builder modelName(OpenAiAudioTranscriptionModelName modelName);

    // HTTP configuration
    public Builder httpClientProvider(HttpClientBuilder httpClientBuilder);
    public Builder timeout(Duration timeout);
    public Builder maxRetries(Integer maxRetries);

    // Logging
    public Builder logRequests(Boolean logRequests);
    public Builder logResponses(Boolean logResponses);
    public Builder logger(Logger logger);

    // Build
    public OpenAiAudioTranscriptionModel build();
}

Basic Usage Example

import dev.langchain4j.model.openai.OpenAiAudioTranscriptionModel;
import dev.langchain4j.model.openai.OpenAiAudioTranscriptionModelName;
import dev.langchain4j.model.audio.AudioTranscriptionRequest;
import dev.langchain4j.model.audio.AudioTranscriptionResponse;
import java.nio.file.Files;
import java.nio.file.Paths;

// Create transcription model
OpenAiAudioTranscriptionModel model = OpenAiAudioTranscriptionModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(OpenAiAudioTranscriptionModelName.WHISPER_1)
    .build();

// Load audio file
byte[] audioData = Files.readAllBytes(Paths.get("meeting.mp3"));

// Create transcription request
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("meeting.mp3")
    .build();

// Transcribe audio
AudioTranscriptionResponse response = model.transcribe(request);
System.out.println("Transcription: " + response.text());

Advanced Configuration Example

import java.time.Duration;

// Create model with advanced settings
OpenAiAudioTranscriptionModel model = OpenAiAudioTranscriptionModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(OpenAiAudioTranscriptionModelName.GPT_4_O_TRANSCRIBE)
    .timeout(Duration.ofMinutes(5))  // Long audio files need more time
    .maxRetries(3)
    .logRequests(true)
    .logResponses(true)
    .build();

// Load audio
byte[] audioData = Files.readAllBytes(Paths.get("podcast.mp3"));

// Create request with language hint and prompt
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("podcast.mp3")
    .language("en")  // English
    .prompt("This is a podcast about artificial intelligence and machine learning.")
    .temperature(0.2)  // Lower temperature for more consistent output
    .responseFormat("verbose_json")  // Get detailed response with timestamps
    .build();

AudioTranscriptionResponse response = model.transcribe(request);
System.out.println("Transcription:\n" + response.text());

Speaker Diarization Example

// Use diarization model to identify different speakers
OpenAiAudioTranscriptionModel diarizeModel = OpenAiAudioTranscriptionModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(OpenAiAudioTranscriptionModelName.GPT_4_O_TRANSCRIBE_DIARIZE)
    .build();

byte[] meetingAudio = Files.readAllBytes(Paths.get("team_meeting.mp3"));

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(meetingAudio)
    .fileName("team_meeting.mp3")
    .responseFormat("verbose_json")
    .build();

AudioTranscriptionResponse response = diarizeModel.transcribe(request);

// Response includes speaker labels
System.out.println("Meeting transcript with speakers:");
System.out.println(response.text());

Multi-Language Support Example

// Transcribe audio in different languages
String[] languages = {"es", "fr", "de", "ja", "zh"};
String[] audioFiles = {
    "spanish.mp3", "french.mp3", "german.mp3",
    "japanese.mp3", "chinese.mp3"
};

OpenAiAudioTranscriptionModel model = OpenAiAudioTranscriptionModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(OpenAiAudioTranscriptionModelName.WHISPER_1)
    .build();

for (int i = 0; i < languages.length; i++) {
    byte[] audioData = Files.readAllBytes(Paths.get(audioFiles[i]));

    AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
        .audioData(audioData)
        .fileName(audioFiles[i])
        .language(languages[i])
        .build();

    AudioTranscriptionResponse response = model.transcribe(request);
    System.out.println(languages[i] + ": " + response.text());
}

Batch Processing Example

import java.io.File;
import java.util.List;
import java.util.stream.Collectors;

public class AudioBatchTranscriber {
    private final OpenAiAudioTranscriptionModel model;

    public AudioBatchTranscriber(String apiKey) {
        this.model = OpenAiAudioTranscriptionModel.builder()
            .apiKey(apiKey)
            .modelName(OpenAiAudioTranscriptionModelName.WHISPER_1)
            .timeout(Duration.ofMinutes(5))
            .build();
    }

    public List<String> transcribeDirectory(String directoryPath) throws Exception {
        File directory = new File(directoryPath);
        File[] audioFiles = directory.listFiles(
            f -> f.getName().endsWith(".mp3") ||
                 f.getName().endsWith(".wav") ||
                 f.getName().endsWith(".m4a")
        );

        List<String> transcriptions = new ArrayList<>();

        for (File file : audioFiles) {
            System.out.println("Transcribing: " + file.getName());

            byte[] audioData = Files.readAllBytes(file.toPath());

            AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
                .audioData(audioData)
                .fileName(file.getName())
                .build();

            AudioTranscriptionResponse response = model.transcribe(request);
            transcriptions.add(response.text());

            // Rate limiting
            Thread.sleep(1000);
        }

        return transcriptions;
    }
}

// Usage
AudioBatchTranscriber transcriber = new AudioBatchTranscriber(apiKey);
List<String> results = transcriber.transcribeDirectory("./audio_files/");

Model Names

@Experimental
public enum OpenAiAudioTranscriptionModelName {
    WHISPER_1("whisper-1"),
    GPT_4_O_TRANSCRIBE("gpt-4o-transcribe"),
    GPT_4_O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe"),
    GPT_4_O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize");

    public String toString();
}

Model Comparison

ModelAccuracySpeedDiarizationCost
whisper-1GoodFastNoLow
gpt-4o-transcribeExcellentMediumNoMedium
gpt-4o-mini-transcribeVery GoodFastNoLow-Medium
gpt-4o-transcribe-diarizeExcellentSlowYesHigh

Types

AudioTranscriptionRequest

public class AudioTranscriptionRequest {
    public static Builder builder();

    public byte[] audioData();
    public String fileName();
    public String language();
    public String prompt();
    public Double temperature();
    public String responseFormat();
}

public static class Builder {
    public Builder audioData(byte[] audioData);
    public Builder fileName(String fileName);
    public Builder language(String language);
    public Builder prompt(String prompt);
    public Builder temperature(Double temperature);
    public Builder responseFormat(String responseFormat);
    public AudioTranscriptionRequest build();
}

AudioTranscriptionResponse

public class AudioTranscriptionResponse {
    public String text();
    public String language();
    public Double duration();
    public List<Segment> segments();
}

AudioTranscriptionModel Interface

public interface AudioTranscriptionModel {
    AudioTranscriptionResponse transcribe(AudioTranscriptionRequest request);
}

Configuration Options

Audio Data

The raw audio file bytes:

  • Required field
  • Supports multiple formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
  • Maximum file size: 25 MB
  • Longer files may need to be split

File Name

Name of the audio file:

  • Required field
  • Used to determine audio format
  • Should include proper extension (.mp3, .wav, etc.)

Language

ISO-639-1 language code:

  • Optional but recommended
  • Improves accuracy and speed
  • Examples: "en" (English), "es" (Spanish), "fr" (French)
  • Whisper supports 50+ languages

Prompt

Optional text to guide transcription:

  • Helps with context, terminology, spelling
  • Maintains style and formatting
  • Should be in the same language as audio
  • Maximum: 224 tokens

Example prompts:

// Technical content
.prompt("This is a discussion about machine learning, neural networks, and deep learning algorithms.")

// Names and terminology
.prompt("The speakers are Dr. Smith and Prof. Johnson discussing quantum computing.")

// Formatting hints
.prompt("The transcript should include technical terms like API, SDK, and REST.")

Temperature

Controls randomness in transcription:

  • Range: 0.0 to 1.0
  • Default: 0.0
  • 0.0: Most deterministic and accurate
  • Higher values: More creative, less literal
  • Usually keep at 0.0 for accuracy

Response Format

Format of the transcription output:

  • "json": Simple JSON with text only (default)
  • "text": Plain text
  • "srt": SubRip subtitle format
  • "verbose_json": Detailed JSON with timestamps and metadata
  • "vtt": WebVTT subtitle format

Timeout

Maximum time to wait for transcription:

  • Default varies by client
  • Audio processing can take time
  • Rule of thumb: ~10% of audio duration
  • For 10-minute audio, set 1-2 minute timeout

Max Retries

Number of retry attempts on failure:

  • Default: 2
  • Automatic retry on transient failures
  • Exponential backoff between retries

Supported Audio Formats

FormatExtensionNotes
MP3.mp3Most common, good compression
MP4.mp4, .m4aAudio from video files
MPEG.mpeg, .mpgaStandard audio format
WAV.wavUncompressed, larger files
WebM.webmModern web format

Best Practices

Preparing Audio Files

Optimize File Size:

// Convert large WAV files to MP3 to reduce size and upload time
// Use external tools like ffmpeg before uploading

Split Long Files:

public List<AudioTranscriptionResponse> transcribeLongAudio(
    byte[] audioData,
    int segmentSizeMinutes
) throws Exception {
    // Split audio into segments (use audio processing library)
    List<byte[]> segments = splitAudio(audioData, segmentSizeMinutes);

    List<AudioTranscriptionResponse> transcriptions = new ArrayList<>();

    for (int i = 0; i < segments.size(); i++) {
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audioData(segments.get(i))
            .fileName("segment_" + i + ".mp3")
            .prompt(getContextFromPrevious(transcriptions))  // Maintain context
            .build();

        AudioTranscriptionResponse response = model.transcribe(request);
        transcriptions.add(response);
    }

    return transcriptions;
}

Improving Accuracy

Specify Language:

// Better accuracy and faster processing
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("audio.mp3")
    .language("en")  // Specify when known
    .build();

Provide Context:

// Help with technical terms and proper nouns
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("tech_talk.mp3")
    .prompt("Discussion about LangChain4j, OpenAI API, and Java programming.")
    .build();

Use High-Quality Audio:

  • Clear recording environment
  • Minimal background noise
  • Good microphone quality
  • Appropriate bit rate (128 kbps or higher)

Handling Different Audio Types

Podcasts:

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("podcast.mp3")
    .responseFormat("verbose_json")  // Get timestamps for chapters
    .prompt("This is a podcast with intro music and multiple segments.")
    .build();

Meetings:

// Use diarization for speaker identification
OpenAiAudioTranscriptionModel model = OpenAiAudioTranscriptionModel.builder()
    .apiKey(apiKey)
    .modelName(OpenAiAudioTranscriptionModelName.GPT_4_O_TRANSCRIBE_DIARIZE)
    .build();

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("meeting.mp3")
    .responseFormat("verbose_json")
    .build();

Lectures:

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
    .audioData(audioData)
    .fileName("lecture.mp3")
    .prompt("University lecture on quantum physics by Professor Anderson.")
    .temperature(0.0)  // Maximum accuracy
    .build();

Post-Processing Transcriptions

public class TranscriptionProcessor {
    public String cleanTranscription(String rawText) {
        return rawText
            .replaceAll("\\s+", " ")  // Normalize whitespace
            .replaceAll("([.!?])([A-Z])", "$1 $2")  // Add space after punctuation
            .trim();
    }

    public String addTimestamps(AudioTranscriptionResponse response) {
        if (response.segments() == null) {
            return response.text();
        }

        StringBuilder formatted = new StringBuilder();
        for (Segment segment : response.segments()) {
            formatted.append(formatTime(segment.start()))
                     .append(" - ")
                     .append(formatTime(segment.end()))
                     .append(": ")
                     .append(segment.text())
                     .append("\n");
        }
        return formatted.toString();
    }

    private String formatTime(double seconds) {
        int hours = (int) (seconds / 3600);
        int minutes = (int) ((seconds % 3600) / 60);
        int secs = (int) (seconds % 60);
        return String.format("%02d:%02d:%02d", hours, minutes, secs);
    }
}

Common Use Cases

Meeting Transcription

Automatically transcribe meetings for notes and action items.

Podcast Transcription

Convert podcasts to text for show notes and SEO.

Voice Memos

Transcribe voice recordings into searchable text.

Accessibility

Provide captions and transcripts for audio content.

Content Analysis

Extract insights from customer calls or interviews.

Language Learning

Transcribe foreign language audio for study materials.

Legal/Medical Transcription

Convert recordings to text for documentation (review for accuracy).

Performance Considerations

Processing Time

  • Typically ~10-30% of audio duration
  • Example: 10-minute audio = ~1-3 minutes processing
  • Longer audio takes proportionally longer
  • Diarization adds ~50% more time

File Size Limits

  • Maximum: 25 MB per file
  • Split larger files into segments
  • Use compressed formats (MP3) over uncompressed (WAV)

Cost

Charged per second of audio:

  • whisper-1: Lower cost
  • gpt-4o variants: Higher cost
  • Diarization: Highest cost
  • Check current pricing for exact rates

Rate Limits

  • Varies by account tier
  • Consider delays between requests for batch processing
  • Monitor API rate limit headers

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-open-ai@1.11.0

docs

advanced-features.md

audio-transcription-models.md

chat-models.md

embedding-models.md

image-models.md

index.md

language-models.md

model-catalog.md

moderation-models.md

request-response.md

token-management.md

README.md

tile.json