CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/speaker-toolkit

Six-skill presentation system: ingest talks into a rhetoric vault, run interactive clarification, generate a speaker profile, create presentations that match your documented patterns, produce the deck illustrations + thumbnail visual layer, and publish talk pages to a Jekyll shownotes site. Includes a 102-entry Presentation Patterns taxonomy (91 observable, 11 unobservable go-live items) for scoring, brainstorming, and go-live preparation.

86

1.24x
Quality

92%

Does it follow best practices?

Impact

86%

1.24x

Average score across 26 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

video-slide-extraction.mdskills/vault-ingress/references/

Video Slide Extraction — Technical Reference

Extract slide images from conference talk videos when no PPTX or PDF is available. This is the fourth slide acquisition path — used when a talk has video_url but neither slides_url nor pptx_path.

Prerequisites

  • yt-dlp (video download)
  • ffmpeg (frame extraction)
  • Python packages: imagehash, Pillow (perceptual deduplication)

Install Python dependencies:

"{python_path}" -m pip install imagehash Pillow

When to Use

Set slide_source: "video_extracted" when:

  • Talk has video_url but no slides_url and no pptx_path
  • The video shows slides on screen (most conference recordings do)

Skip video extraction when:

  • PPTX or PDF is available (those are higher quality sources)
  • The video is audio-only, a panel/interview with no slides, or a pure live-coding demo

Pipeline Overview

video → download (yt-dlp, 720p) → extract frames (ffmpeg, 1 per 2s)
      → crop to slide region → deduplicate (perceptual hash)
      → save unique slides → combine into PDF

Step 1: Download Video

Download at 720p — enough resolution to read slide text, small enough to be fast.

yt-dlp -f "bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720][ext=mp4]/best[height<=720]" \
  --merge-output-format mp4 \
  -o "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
  "https://www.youtube.com/watch?v={youtube_id}"

For talks where 720p is unavailable, yt-dlp will fall back to the best available.

Step 2: Extract Frames

Extract one frame every 2 seconds. This captures slide transitions without generating excessive frames (~1500 frames for a 50-min talk).

mkdir -p "{vault_root}/slides-rebuild/{youtube_id}/frames"
ffmpeg -i "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
  -vf "fps=0.5" -q:v 2 \
  "{vault_root}/slides-rebuild/{youtube_id}/frames/frame_%05d.jpg"

Step 3: Detect Slide Region and Crop

Conference videos have varying layouts — slides may occupy the full frame, or share space with a speaker camera (PiP), conference branding bars, or lower-third titles. The script auto-detects the slide region.

Step 4: Deduplicate by Perceptual Hash

Adjacent frames showing the same slide produce near-identical perceptual hashes. Group consecutive similar frames and keep one representative per group.

Step 5: Combine into PDF

Assemble unique slides into a single PDF for analysis, matching the format of Google Drive PDFs used elsewhere in the vault.

Usage

Run scripts/video-slide-extraction.py for each video after downloading it:

# Download video at 720p
yt-dlp -f "bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720]" \
  --merge-output-format mp4 \
  -o "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
  "https://www.youtube.com/watch?v={youtube_id}"

# Extract slides
python3 scripts/video-slide-extraction.py \
  "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
  "{vault_root}/slides-rebuild/{youtube_id}" \
  "{youtube_id}"

# Copy PDF to slides dir, then delete the video
cp "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.pdf" "{vault_root}/slides/{youtube_id}.pdf"
rm "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4"

For batch downloads: scripts/batch-download-videos.sh <vault_root> ID1 ID2 ...

Update the talk's DB entry: slide_source: "video_extracted", slides_local_path: "slides/{youtube_id}.pdf", structured_data.video_extraction: <script output>.

What This Produces

OutputLocationPurpose
Slide PDFslides/{youtube_id}.pdfVisual analysis (same as Google Drive PDFs)
Extraction metadatastructured_data.video_extractionFrame counts, region detection, threshold
Intermediate framesDeleted after PDF generationSaves disk space

Layout Detection Heuristics

Common conference video layouts and how the script handles them:

LayoutExampleSlide Region
Full-frame slidesMost Devoxx, JFokusNone (full frame)
Slides + speaker PiP (corner)DevOpsDays, meetups70-85% left/center
Slides + speaker sidebarQCon, some webinars60-75% left
Speaker + slides behindTED-style keynotesVariable, may fail
Split screen 50/50Co-presented live coding50% left or right

The detect_slide_region() function handles the first three automatically via variance analysis. For split-screen formats, manual slide_region override may be needed — pass it as a parameter.

Tuning the Hash Threshold

The hash_threshold parameter controls deduplication aggressiveness:

ValueBehaviorBest For
4-6Aggressive: merges similar slidesDense meme-heavy talks where each slide is visually distinct
8-10Moderate: good defaultMost conference talks (fullscreen slide recordings)
12-16Loose: keeps more variationProgressive-reveal-heavy talks (table rows appearing one-by-one)
14-18Very looseWide-angle room recordings where speaker movement dominates

For talks in the speaker's mode (a) polemic style with progressive reveals, use threshold 12. For demo-heavy or minimal-slide talks, use 8.

Wide-angle room recordings (meetups, DevOpsDays, early-era conference recordings) where the camera captures the full stage — speaker walking + slides projected behind — defeat the default dedup. Every frame looks different because the speaker moved. Options:

  1. Increase threshold to 14-18
  2. Manually specify slide_region to crop out the speaker and isolate the screen
  3. Accept the bloated PDF (800-1500 pages) and have the analysis subagent SAMPLE frames at intervals rather than reading every page

Integration with the Skill Workflow

In Step 3 of the skill (per-talk subagent):

if slide_source == "video_extracted":
    1. Download video: yt-dlp -f "best[height<=720]" ...
    2. Run extract_slides_from_video()
    3. Copy PDF to slides/{youtube_id}.pdf
    4. Read the PDF for visual analysis (dimension 13)
    5. Delete the video file (keep only the PDF)
    6. Store extraction metadata in structured_data

The resulting PDF is analyzed exactly like a Google Drive PDF — the subagent reads it for slide design patterns (backgrounds, typography, shapes, memes, footer, etc.) using the same dimension 13 analysis.

Cleanup

After extraction is complete and the PDF is saved:

  • Delete the downloaded MP4 video (typically 100-500 MB)
  • Delete the frames directory (already done by the script)
  • Keep only the PDF in slides/{youtube_id}.pdf

For a full 83-talk batch, the video downloads would consume ~20-40 GB temporarily but only ~1-2 GB of PDFs remain after cleanup.

Limitations

  • Speaker overlay: If the speaker's face overlaps slides (green-screen overlay style), frame extraction still works but the perceptual hash may treat the same slide with different speaker positions as different slides. Increase threshold.
  • Animated slides: Animations within a single slide produce multiple frames. The dedup catches most of these, but fast animations at exactly the 2-second boundary may produce duplicates. Not a significant issue in practice.
  • Progressive reveals: The speaker's talks frequently use progressive reveals (table rows appearing one-by-one). These ARE different slides rhetorically and SHOULD be kept as separate pages. The default threshold of 8-12 handles this correctly — each reveal step looks sufficiently different.
  • Low-quality uploads: Some older conference videos are 360p or lower. Frame extraction still works but slide text may be unreadable. Flag these with video_quality: "low" in structured_data.
  • No audio sync: Frame timestamps are not correlated with transcript timestamps. The subagent must use content matching (reading slide text and matching to transcript passages) rather than time alignment.

skills

CHANGELOG.md

README.md

tessl.json

tile.json