CtrlK
BlogDocsLog inGet started
Tessl Logo

groq-core-workflow-b

Execute Groq secondary workflows: audio transcription (Whisper), vision, text-to-speech, and batch model evaluation. Trigger with phrases like "groq whisper", "groq transcription", "groq audio", "groq vision", "groq TTS", "groq speech".

84

Quality

82%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Groq Core Workflow B: Audio, Vision & Speech

Overview

Beyond chat completions, Groq provides ultra-fast audio transcription (Whisper at 216x real-time), multimodal vision (Llama 4 Scout/Maverick), and text-to-speech. These endpoints use the same groq-sdk client.

Prerequisites

  • groq-sdk installed, GROQ_API_KEY set
  • For audio: audio files in supported formats
  • For vision: image URLs or base64 images

Audio Models

Model IDLanguagesSpeedBest For
whisper-large-v3100+164x real-timeBest accuracy, multilingual
whisper-large-v3-turbo100+216x real-timeBest speed/accuracy balance

Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm

Instructions

Step 1: Audio Transcription (Whisper)

import Groq from "groq-sdk";
import fs from "fs";

const groq = new Groq();

// Transcribe audio file
async function transcribe(filePath: string): Promise<string> {
  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3-turbo",
    response_format: "json",        // or "text" or "verbose_json"
    language: "en",                  // Optional: ISO 639-1 code
  });

  return transcription.text;
}

// With timestamps (verbose mode)
async function transcribeWithTimestamps(filePath: string) {
  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3-turbo",
    response_format: "verbose_json",
    timestamp_granularities: ["segment"],
  });

  return transcription;
  // Returns segments with start/end times
}

Step 2: Audio Translation (to English)

// Translate any language audio to English text
async function translateAudio(filePath: string): Promise<string> {
  const translation = await groq.audio.translations.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3",
  });

  return translation.text;
}

Step 3: Vision (Image Understanding)

// Analyze images with Llama 4 Scout (up to 5 images per request)
async function analyzeImage(imageUrl: string, question: string) {
  const completion = await groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: question },
          { type: "image_url", image_url: { url: imageUrl } },
        ],
      },
    ],
    max_tokens: 1024,
  });

  return completion.choices[0].message.content;
}

// Multiple images
async function compareImages(urls: string[], prompt: string) {
  const imageContent = urls.map((url) => ({
    type: "image_url" as const,
    image_url: { url },
  }));

  const completion = await groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [{
      role: "user",
      content: [{ type: "text", text: prompt }, ...imageContent],
    }],
    max_tokens: 2048,
  });

  return completion.choices[0].message.content;
}

// Base64 image input
async function analyzeBase64Image(base64Data: string) {
  return groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "Describe this image in detail." },
        {
          type: "image_url",
          image_url: { url: `data:image/jpeg;base64,${base64Data}` },
        },
      ],
    }],
  });
}

Step 4: Text-to-Speech

// Generate speech from text
async function textToSpeech(text: string, outputPath: string) {
  const response = await groq.audio.speech.create({
    model: "playai-tts",          // or "playai-tts-arabic"
    input: text,
    voice: "Arista-PlayAI",      // See Groq docs for voice options
    response_format: "wav",       // wav, mp3, flac, opus, aac
  });

  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
  console.log(`Audio saved to ${outputPath}`);
}

Step 5: Python Audio Transcription

from groq import Groq

client = Groq()

# Transcribe
with open("audio.mp3", "rb") as file:
    transcription = client.audio.transcriptions.create(
        file=("audio.mp3", file),
        model="whisper-large-v3-turbo",
        response_format="verbose_json",
    )
    print(transcription.text)
    for segment in transcription.segments:
        print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

Step 6: Model Benchmarking

// Compare models on same prompt for speed vs quality
async function benchmarkModels(prompt: string) {
  const models = [
    "llama-3.1-8b-instant",
    "llama-3.3-70b-versatile",
    "llama-3.3-70b-specdec",
  ];

  for (const model of models) {
    const start = performance.now();
    const result = await groq.chat.completions.create({
      model,
      messages: [{ role: "user", content: prompt }],
      max_tokens: 200,
    });
    const elapsed = performance.now() - start;
    const tps = result.usage!.completion_tokens / ((result.usage as any).completion_time || 1);

    console.log(
      `${model.padEnd(45)} | ${elapsed.toFixed(0)}ms | ${tps.toFixed(0)} tok/s | ${result.usage!.total_tokens} tokens`
    );
  }
}

Vision Model Limits

  • Maximum 5 images per request
  • Supported formats: JPEG, PNG, GIF, WebP
  • Images fetched from URL or embedded as base64
  • Vision models also support tool use, JSON mode, and streaming

Error Handling

ErrorCauseSolution
Invalid file formatUnsupported audio typeConvert to mp3/wav/flac first
File too largeAudio exceeds 25MBSplit into smaller chunks
model_not_foundVision model ID wrongUse full path: meta-llama/llama-4-scout-17b-16e-instruct
max_images_exceeded>5 images in requestReduce to 5 or fewer images
429 on WhisperAudio RPM limit hitQueue transcription requests

Resources

  • Groq Speech-to-Text
  • Groq Text-to-Speech
  • Groq Vision
  • Groq Models

Next Steps

For common errors and troubleshooting, see groq-common-errors.

Repository
jeremylongshore/claude-code-plugins-plus-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.