CtrlK
BlogDocsLog inGet started
Tessl Logo

catalan-adobe/video-digest

Summarize any video by analyzing both audio and visuals. Downloads via yt-dlp, extracts transcript (YouTube captions or Whisper), pulls scene-detected keyframes, and produces a multimodal summary with clickable timestamped YouTube links. Use this skill whenever the user wants to summarize a YouTube video, digest a talk or tutorial, get notes from a video, extract key points from a recording, or says things like "tl;dw", "summarize this video", "what's in this video", or pastes a YouTube URL and asks for a summary. Also triggers for non-YouTube URLs that yt-dlp supports.

94

Quality

94%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

SUBAGENT-PROMPT.mdreferences/

Subagent Prompt Template

Spawn one Agent per chunk using this template. Replace all <placeholder> values with actual data from the chunk being analyzed.

Agent(
  subagent_type="general-purpose",
  description="Analyze: <chapter_title>",
  prompt="You are analyzing a segment of a video for summarization.
This is a RESEARCH-ONLY task — do not write any code or edit files.

## Video Info
Title: <title>
Channel: <channel>
Segment: <start> - <end> (<chapter_title>)
Summary depth: <brief|detailed|full>

## Transcript
<timestamped transcript lines for this segment>

## Visual Frames
IMPORTANT: You MUST read the contact sheet image(s) listed below
using the Read tool before writing your summary. These are scene-
detected keyframes with burned-in timestamps from this segment.

Contact sheet(s) for this segment:
- <absolute_path_to_sheet_NNN.jpg>

After reading, note what is visible: slides, code, diagrams, UI
state, text overlays, presenter gestures, or scene changes. If
the visuals are mostly static (e.g., talking head), say so briefly
and focus on the transcript — but still read the image to confirm.

## Task
Produce a section summary combining what is said (transcript) with
what is shown (frames). Adapt to the requested depth:

- brief: 1-2 sentences capturing the main point
- detailed: key topics, sub-points, and notable visual elements
- full: comprehensive notes including specific details, quotes,
  code snippets, diagram descriptions, and all visual context

For ALL depth levels, include:
- The most important timestamp(s) worth jumping to
- Any visual elements that add context beyond the audio
  (slides, diagrams, code, demos, UI, whiteboard)

Format your output as:
### <chapter_title> [MM:SS]
<summary content>

**Key moments:**
- [MM:SS] <description>

**Notable frames** (ONLY if genuinely important visual content —
diagrams, key slides, code, demos, charts, significant UI state):
- [MM:SS] what makes this frame important

Be aggressive: most frames are NOT notable. A talking head, generic
title slide, or static screen is NOT notable. Only flag frames where
the visual IS the content a reader would want to see.

Save output to <workdir>/chunk_<N>_summary.txt"
)

references

SUBAGENT-PROMPT.md

SKILL.md

tile.json