Guide for implementing Google Gemini API audio capabilities - analyze audio with transcription, summarization, and understanding (up to 9.5 hours), plus generate speech with controllable TTS. Use when processing audio files, creating transcripts, analyzing speech/music/sounds, or generating natural speech from text.
Overall
score
18%
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Process audio with transcription, analysis, and understanding, plus generate natural speech using Google's Gemini API. Supports up to 9.5 hours of audio per request with multiple formats.
Use this skill when you need to:
The skill automatically detects your GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".claude/skills/gemini-audio/.env./.env (project root)Get your API key: Visit Google AI Studio
Create .env file with:
GEMINI_API_KEY=your_api_key_hereInstall required package:
pip install google-genaifrom google import genai
import os
# API key auto-detected from environment
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Upload audio file
myfile = client.files.upload(file='podcast.mp3')
# Transcribe
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Generate a transcript of the speech.', myfile]
)
print(response.text)
# Summarize
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize the key points in 5 bullets.', myfile]
)
print(response.text)# Transcribe audio
python .claude/skills/gemini-audio/scripts/transcribe.py audio.mp3
# Summarize audio
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"Summarize key points"
# Analyze specific segment (timestamps in MM:SS format)
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"What is discussed from 02:30 to 05:15?"
# Generate speech
python .claude/skills/gemini-audio/scripts/generate-speech.py \
"Welcome to our podcast" \
--output welcome.wav| Format | MIME Type | Best Use |
|---|---|---|
| WAV | audio/wav | Uncompressed, highest quality |
| MP3 | audio/mp3 | Compressed, widely compatible |
| AAC | audio/aac | Compressed, good quality |
| FLAC | audio/flac | Lossless compression |
| OGG Vorbis | audio/ogg | Open format |
| AIFF | audio/aiff | Apple format |
| Model | Quality | Speed | Cost/1M tokens |
|---|---|---|---|
gemini-2.5-flash-native-audio-preview-09-2025 | High | Fast | $10 |
gemini-2.5-pro TTS mode | Premium | Slower | $20 |
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio: Welcome to today\'s episode, in a warm, friendly tone.'
)
# Save audio output
with open('output.wav', 'wb') as f:
f.write(response.audio_data)# Upload and reuse
myfile = client.files.upload(file='large-audio.mp3')
# Use file multiple times
response1 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe this', myfile]
)
response2 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize this', myfile]
)from google.genai import types
with open('small-audio.mp3', 'rb') as f:
audio_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Describe this audio',
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
]
)python scripts/transcribe.py meeting.mp3 --include-timestampspython scripts/analyze.py interview.wav "Extract main topics and key quotes"python scripts/analyze.py discussion.mp3 "Identify speakers and extract dialogue"python scripts/analyze.py podcast.mp3 "Summarize content from 10:30 to 15:45"python scripts/analyze.py ambient.wav "Identify all sounds: voices, music, ambient"gemini-2.5-flash ($1/1M tokens) for most tasksgemini-2.5-pro ($3/1M tokens) for complex analysisAudio Input (32 tokens/second):
Model Pricing:
TTS Pricing:
For detailed information, see:
references/api-reference.md - Complete API specificationsreferences/code-examples.md - Comprehensive code examplesreferences/tts-guide.md - Text-to-speech implementation guidereferences/best-practices.md - Advanced optimization strategiesAll scripts support 3-step API key detection:
Run any script with --help for detailed usage.
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.