Summarize any video by analyzing both audio and visuals. Downloads via yt-dlp, extracts transcript (YouTube captions or Whisper), pulls scene-detected keyframes, and produces a multimodal summary with clickable timestamped YouTube links. Use this skill whenever the user wants to summarize a YouTube video, digest a talk or tutorial, get notes from a video, extract key points from a recording, or says things like "tl;dw", "summarize this video", "what's in this video", or pastes a YouTube URL and asks for a summary. Also triggers for non-YouTube URLs that yt-dlp supports.
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Summarize a video by combining audio transcript with visual frame analysis. Produces a markdown summary with clickable timestamped links back to the source video.
Video URL (or local file)
0. Check dependencies (yt-dlp, ffmpeg)
1. Download video + metadata
2. Extract transcript (YouTube captions first, Whisper fallback)
3. Extract keyframes via scene detection
4. Segment into chapters or time-based chunks
5. Parallel subagents analyze chunks (frames + transcript)
6. Synthesize final summary
7. Write markdown file + print stdout previewAll yt-dlp, ffmpeg, and whisper operations go through the helper
script bundled with this skill at scripts/video-digest.sh.
Locating the script:
if [[ -n "${CLAUDE_SKILL_DIR:-}" ]]; then
DIGEST_SH="${CLAUDE_SKILL_DIR}/scripts/video-digest.sh"
else
DIGEST_SH="$(command -v video-digest.sh 2>/dev/null || \
find ~/.claude -path "*/video-digest/scripts/video-digest.sh" -type f 2>/dev/null | head -1)"
fi
if [[ -z "$DIGEST_SH" || ! -f "$DIGEST_SH" ]]; then
echo "Error: video-digest.sh not found. Ask the user for the path." >&2
fiStore the result in DIGEST_SH and use it for all subsequent commands.
Commands:
deps — check and report dependency statusdownload <url> [workdir] — download video + metadata + thumbnailtranscript <workdir> [--force-whisper] [--lang LANG] — extract captions or transcribeframes <video> [threshold] [workdir] — scene-detect keyframes + contact sheetsinfo <workdir> — parse metadata, report title/duration/chaptersParse these from the user's message or ask if ambiguous:
| Flag | Default | Purpose |
|---|---|---|
--depth | detailed | brief, detailed, or full |
--force-whisper | off | Skip YouTube captions, transcribe with whisper-ctranslate2 |
--scene-threshold | 0.3 | ffmpeg scene detection sensitivity (0.1-1.0) |
--lang | en | Subtitle/transcription language code |
Locate the script and check dependencies:
"$DIGEST_SH" depsIf any required dependency is missing, stop and offer to install:
brew install yt-dlp ffmpegDo NOT proceed to Step 1 until both yt-dlp and ffmpeg are confirmed
available. Re-run deps after installation to verify.
Optional: whisper-ctranslate2 (only needed with --force-whisper).
Auto-installed on first use via uv tool install whisper-ctranslate2.
Ask the user for the video URL if not already provided. Then download:
"$DIGEST_SH" download "<url>" "<workdir>"The work directory defaults to ./video_digest_<video_id>/ in the
current working directory. After download, report to the user:
title, channel, duration, whether chapters are available, file size.
Steps 2 and 3 are independent — run them simultaneously using two Bash tool calls in the same message. This saves significant time, especially on longer videos where both operations are slow.
"$DIGEST_SH" transcript "<workdir>" [--force-whisper] [--lang en]Default path: Extracts YouTube captions (manual preferred over auto-generated). Parses the VTT file into timestamped segments.
--force-whisper path: Extracts audio as WAV, transcribes with
whisper-ctranslate2 (faster-whisper backend). Produces timestamped SRT
which is converted to our format automatically.
Auto-fallback: If no YouTube captions exist and --force-whisper
was not specified, the script auto-engages Whisper and prints a
notice. Inform the user this is happening and that it takes longer.
The transcript is saved as <workdir>/transcript.txt with timestamps:
[00:00:05] Welcome to this talk about...
[00:00:12] Today we'll cover three topics..."$DIGEST_SH" frames "<workdir>/<video_file>" [threshold] "<workdir>"Default scene threshold is 0.3 (tunable by user). The script:
<workdir>/frames/timecodes.txtReport: number of keyframes extracted, number of contact sheets.
If very few frames are extracted (< 5 for a video > 2 minutes),
suggest the user lower the threshold: "Only N frames detected.
Try --scene-threshold 0.2 for more granularity?"
Parse the metadata JSON for chapter information:
"$DIGEST_SH" info "<workdir>"Short videos (< 10 minutes): Skip chunking entirely. Read the full transcript AND every contact sheet image (using the Read tool) in a single analysis pass (Step 5 without subagents). This avoids unnecessary overhead for content that fits in one context window.
Longer videos with chapters: Use chapters as segment boundaries.
Longer videos without chapters: Split into ~10-minute chunks, aligning boundaries to the nearest keyframe timecode.
For each chunk, prepare the transcript segment, contact sheet(s) covering that time window, and a title (chapter name or "Part N: MM:SS - MM:SS").
For short videos (< 10 min), read the transcript and all contact sheets yourself (Read tool) and produce the summary inline. Flag notable frames for the screenshot gallery. Skip to Step 6.
For longer videos, spawn one Agent per chunk, ALL IN PARALLEL. Each subagent receives both the transcript segment and contact sheet path(s). Never skip frames — on-screen text, graphics, and UI states add context absent from audio.
Mapping contact sheets to chunks: Use burned-in timestamps to determine which sheet(s) cover each chunk. A chunk may span two sheets — include both.
Spawn one Agent per chunk using the template in the subagent prompt. Fill in the placeholders with each chunk's title, time range, transcript segment, and contact sheet path(s).
After all agents complete, read their outputs. If any failed, re-run individually — the pipeline tolerates partial results but flag gaps to the user.
Combine all chunk summaries into a cohesive document. The video URL is needed for timestamp links — extract the video ID from the metadata JSON.
Build YouTube deep links using the format:
https://youtube.com/watch?v=<VIDEO_ID>&t=<SECONDS>
Prepare assets directory:
Create <workdir>/assets/ and populate it:
Thumbnail: Find the downloaded thumbnail in workdir (usually
<id>.webp or <id>.jpg from --write-thumbnail). Copy to
assets/<video_id>_thumbnail.jpg, converting if needed:
ffmpeg -y -loglevel error -i "<workdir>/<thumbnail>" "<workdir>/assets/<video_id>_thumbnail.jpg"Screenshots: Collect notable frames flagged by subagents (or
from your own analysis for short videos). Match their timestamps
against frames/timecodes.txt (line N = frame_<N zero-padded to 4>.jpg) to find the closest frame file. Copy selected frames
to assets/<video_id>_screenshot_01.jpg, <video_id>_screenshot_02.jpg, etc.
Only include frames with genuine visual importance (diagrams, slides, code, charts, UI states). Aim for 3-8 screenshots. If none are notable, omit the Screenshots section.
For local files (no yt-dlp download), omit the URL line and thumbnail if no thumbnail was downloaded.
Structure the final markdown:
# Video Digest: <Title>
**Channel:** <uploader> | **Duration:** <duration> | **Date:** <date>\
**URL:** <source-url>

## tl;dw
<2-3 sentence overview — always present regardless of depth>
## Contents
- [Chapter Title](#section-anchor) ([MM:SS](youtube-deep-link))
- ...
## <Chapter Title>
<section summary with inline timestamp links>
...
## Key Moments
- [MM:SS](youtube-deep-link) — <description>
- ...
## Screenshots

*[MM:SS](youtube-deep-link) — <description>*
...For brief depth: tl;dw + Contents with one-line descriptions only.
For detailed depth: full structure as above.
For full depth: full structure plus exhaustive notes per section.
Screenshots are included at all depth levels when notable frames exist.
Save the full summary to <workdir>/digest.md.
Print a condensed preview to stdout:
Video Digest: <Title>
Duration: MM:SS | Sections: N | Frames analyzed: N | Screenshots: N
tl;dw
<2-3 sentence overview>
Sections
- [00:00 - Introduction](https://youtube.com/watch?v=xxx&t=0)
- [05:23 - Setting up the project](https://youtube.com/watch?v=xxx&t=323)
- ...
Full summary: <workdir>/digest.md
Assets: <workdir>/assets/SKILL.md to ~/.claude/commands/video-digest.mdscripts/video-digest.sh to ~/.local/bin/video-digest.sh
and chmod +x itcommand -v video-digest.sh