Two-skill presentation system: analyze your speaking style into a rhetoric knowledge vault, then create new presentations that match your documented patterns. Includes an 88-entry Presentation Patterns taxonomy for scoring, brainstorming, and go-live preparation.
96
Quality
96%
Does it follow best practices?
Impact
96%
1.57xAverage score across 15 eval scenarios
Extract slide images from conference talk videos when no PPTX or PDF is available.
This is the fourth slide acquisition path — used when a talk has video_url but neither
slides_url nor pptx_path.
yt-dlp (video download)ffmpeg (frame extraction)imagehash, Pillow (perceptual deduplication)Install Python dependencies:
"{python_path}" -m pip install imagehash PillowSet slide_source: "video_extracted" when:
video_url but no slides_url and no pptx_pathSkip video extraction when:
video → download (yt-dlp, 720p) → extract frames (ffmpeg, 1 per 2s)
→ crop to slide region → deduplicate (perceptual hash)
→ save unique slides → combine into PDFDownload at 720p — enough resolution to read slide text, small enough to be fast.
yt-dlp -f "bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720][ext=mp4]/best[height<=720]" \
--merge-output-format mp4 \
-o "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
"https://www.youtube.com/watch?v={youtube_id}"For talks where 720p is unavailable, yt-dlp will fall back to the best available.
Extract one frame every 2 seconds. This captures slide transitions without generating excessive frames (~1500 frames for a 50-min talk).
mkdir -p "{vault_root}/slides-rebuild/{youtube_id}/frames"
ffmpeg -i "{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4" \
-vf "fps=0.5" -q:v 2 \
"{vault_root}/slides-rebuild/{youtube_id}/frames/frame_%05d.jpg"Conference videos have varying layouts — slides may occupy the full frame, or share space with a speaker camera (PiP), conference branding bars, or lower-third titles. The script auto-detects the slide region.
Adjacent frames showing the same slide produce near-identical perceptual hashes. Group consecutive similar frames and keep one representative per group.
Assemble unique slides into a single PDF for analysis, matching the format of Google Drive PDFs used elsewhere in the vault.
Run this for each video. It handles steps 2-5 after the video is downloaded.
import os
import sys
import glob
import json
from pathlib import Path
# Check dependencies
try:
import imagehash
from PIL import Image
except ImportError:
print("ERROR: Install dependencies: pip install imagehash Pillow")
sys.exit(1)
def extract_frames(video_path, frames_dir, fps=0.5):
"""Extract frames from video at specified fps."""
os.makedirs(frames_dir, exist_ok=True)
cmd = (
f'ffmpeg -i "{video_path}" -vf "fps={fps}" -q:v 2 '
f'"{frames_dir}/frame_%05d.jpg" -y -loglevel warning'
)
ret = os.system(cmd)
if ret != 0:
raise RuntimeError(f"ffmpeg failed with code {ret}")
frames = sorted(glob.glob(f"{frames_dir}/frame_*.jpg"))
print(f" Extracted {len(frames)} frames")
return frames
def detect_slide_region(frames, sample_size=10):
"""Auto-detect the slide region by analyzing variance across sample frames.
Conference videos typically have a static border (conference branding,
speaker PiP in a fixed corner) and a dynamic center (the slides).
We find the bounding box of the high-variance region.
Returns (left, upper, right, lower) as fraction of image dimensions,
or None if slides appear to be full-frame.
"""
import numpy as np
if len(frames) < sample_size * 2:
return None # Too few frames, assume full-frame
# Sample evenly spaced frame pairs
step = max(1, len(frames) // sample_size)
diffs = []
for i in range(0, len(frames) - step, step):
img1 = np.array(Image.open(frames[i]).convert('L').resize((320, 180)))
img2 = np.array(Image.open(frames[i + step]).convert('L').resize((320, 180)))
diff = np.abs(img1.astype(float) - img2.astype(float))
diffs.append(diff)
# Average difference map — high values = dynamic (slide content changes)
avg_diff = np.mean(diffs, axis=0)
# Threshold: regions with above-median change are "slide area"
threshold = np.percentile(avg_diff, 60)
mask = avg_diff > threshold
# Find bounding box of the active region
rows = np.any(mask, axis=1)
cols = np.any(mask, axis=0)
if not rows.any() or not cols.any():
return None # No clear region detected
rmin, rmax = np.where(rows)[0][[0, -1]]
cmin, cmax = np.where(cols)[0][[0, -1]]
h, w = avg_diff.shape # 180, 320
# Convert to fractions with a small margin
margin = 0.02
region = (
max(0, cmin / w - margin),
max(0, rmin / h - margin),
min(1, (cmax + 1) / w + margin),
min(1, (rmax + 1) / h + margin),
)
# If region covers >90% of the frame, it's effectively full-frame
area = (region[2] - region[0]) * (region[3] - region[1])
if area > 0.9:
return None
print(f" Detected slide region: {region[0]:.0%}-{region[2]:.0%} horizontal, "
f"{region[1]:.0%}-{region[3]:.0%} vertical ({area:.0%} of frame)")
return region
def crop_frame(img, region):
"""Crop an image to the detected slide region."""
if region is None:
return img
w, h = img.size
box = (
int(region[0] * w),
int(region[1] * h),
int(region[2] * w),
int(region[3] * h),
)
return img.crop(box)
def deduplicate_frames(frames, slide_region=None, hash_threshold=8):
"""Deduplicate consecutive similar frames using perceptual hashing.
Returns list of (frame_path, frame_index) for unique slides.
hash_threshold: lower = stricter dedup (fewer slides).
- 4-6: aggressive, may merge progressive reveals
- 8-12: moderate, good default for most talks
- 14+: loose, keeps more variation (use for progressive-reveal-heavy talks)
"""
unique_slides = []
prev_hash = None
for i, frame_path in enumerate(frames):
img = Image.open(frame_path)
# Hash the CROPPED region (slide only, not speaker PiP)
cropped = crop_frame(img, slide_region)
h = imagehash.phash(cropped, hash_size=16)
if prev_hash is None or abs(h - prev_hash) > hash_threshold:
unique_slides.append((frame_path, i))
prev_hash = h
print(f" Deduplicated: {len(frames)} frames -> {len(unique_slides)} unique slides")
return unique_slides
def combine_to_pdf(unique_slides, output_pdf, slide_region=None):
"""Combine unique slide frames into a PDF.
Saves FULL (uncropped) frames — the crop region was only used for
hash comparison. The full frame preserves speaker PiP context which
can be useful for analyzing co-presentation dynamics.
"""
images = []
for frame_path, _ in unique_slides:
img = Image.open(frame_path).convert('RGB')
images.append(img)
if not images:
print(" WARNING: No unique slides found")
return None
images[0].save(output_pdf, save_all=True, append_images=images[1:])
size_mb = os.path.getsize(output_pdf) / (1024 * 1024)
print(f" Saved PDF: {output_pdf} ({len(images)} pages, {size_mb:.1f} MB)")
return output_pdf
def extract_slides_from_video(video_path, output_dir, youtube_id,
fps=0.5, hash_threshold=8):
"""Full pipeline: frames -> detect region -> dedup -> PDF.
Args:
video_path: Path to downloaded MP4
output_dir: Directory for intermediate files and output PDF
youtube_id: YouTube video ID (used for naming)
fps: Frames per second to extract (0.5 = 1 frame per 2 seconds)
hash_threshold: Perceptual hash distance threshold for dedup (8-12 recommended)
Returns:
dict with extraction results for structured_data
"""
frames_dir = os.path.join(output_dir, "frames")
output_pdf = os.path.join(output_dir, f"{youtube_id}.pdf")
print(f"Extracting slides from {youtube_id}...")
# Step 2: Extract frames
frames = extract_frames(video_path, frames_dir, fps=fps)
if not frames:
return {"error": "No frames extracted", "slide_count": 0}
# Step 3: Detect slide region
slide_region = detect_slide_region(frames)
# Step 4: Deduplicate
unique_slides = deduplicate_frames(frames, slide_region, hash_threshold)
# Step 5: Combine into PDF
pdf_path = combine_to_pdf(unique_slides, output_pdf, slide_region)
# Cleanup: remove frame JPEGs to save space (keep PDF)
for f in frames:
os.remove(f)
try:
os.rmdir(frames_dir)
except OSError:
pass
result = {
"slide_source": "video_extracted",
"total_frames_extracted": len(frames),
"unique_slides_count": len(unique_slides),
"hash_threshold_used": hash_threshold,
"slide_region_detected": slide_region is not None,
"slide_region": slide_region,
"output_pdf": pdf_path,
"fps_used": fps,
}
print(f" Done: {len(unique_slides)} unique slides extracted")
return result
# Usage from the rhetoric-knowledge-vault skill:
#
# vault_root = "/path/to/rhetoric-knowledge-vault"
# youtube_id = "aUEyM59Ob2k"
# video_path = f"{vault_root}/slides-rebuild/{youtube_id}/{youtube_id}.mp4"
# output_dir = f"{vault_root}/slides-rebuild/{youtube_id}"
#
# result = extract_slides_from_video(video_path, output_dir, youtube_id)
#
# # Copy the PDF to the slides directory for analysis
# import shutil
# slides_pdf = f"{vault_root}/slides/{youtube_id}.pdf"
# shutil.copy2(result["output_pdf"], slides_pdf)
#
# # Update the talk entry in the tracking DB:
# talk["slide_source"] = "video_extracted"
# talk["slides_local_path"] = slides_pdf
# talk["structured_data"]["video_extraction"] = result| Output | Location | Purpose |
|---|---|---|
| Slide PDF | slides/{youtube_id}.pdf | Visual analysis (same as Google Drive PDFs) |
| Extraction metadata | structured_data.video_extraction | Frame counts, region detection, threshold |
| Intermediate frames | Deleted after PDF generation | Saves disk space |
Common conference video layouts and how the script handles them:
| Layout | Example | Slide Region |
|---|---|---|
| Full-frame slides | Most Devoxx, JFokus | None (full frame) |
| Slides + speaker PiP (corner) | DevOpsDays, meetups | 70-85% left/center |
| Slides + speaker sidebar | QCon, some webinars | 60-75% left |
| Speaker + slides behind | TED-style keynotes | Variable, may fail |
| Split screen 50/50 | Co-presented live coding | 50% left or right |
The detect_slide_region() function handles the first three automatically via
variance analysis. For split-screen formats, manual slide_region override may
be needed — pass it as a parameter.
The hash_threshold parameter controls deduplication aggressiveness:
| Value | Behavior | Best For |
|---|---|---|
| 4-6 | Aggressive: merges similar slides | Dense meme-heavy talks where each slide is visually distinct |
| 8-10 | Moderate: good default | Most conference talks |
| 12-16 | Loose: keeps more variation | Progressive-reveal-heavy talks (table rows appearing one-by-one) |
For talks in the speaker's mode (a) polemic style with progressive reveals, use threshold 12. For demo-heavy or minimal-slide talks, use 8.
In Step 3 of the skill (per-talk subagent):
if slide_source == "video_extracted":
1. Download video: yt-dlp -f "best[height<=720]" ...
2. Run extract_slides_from_video()
3. Copy PDF to slides/{youtube_id}.pdf
4. Read the PDF for visual analysis (dimension 13)
5. Delete the video file (keep only the PDF)
6. Store extraction metadata in structured_dataThe resulting PDF is analyzed exactly like a Google Drive PDF — the subagent reads it for slide design patterns (backgrounds, typography, shapes, memes, footer, etc.) using the same dimension 13 analysis.
After extraction is complete and the PDF is saved:
slides/{youtube_id}.pdfFor a full 83-talk batch, the video downloads would consume ~20-40 GB temporarily but only ~1-2 GB of PDFs remain after cleanup.
video_quality: "low" in structured_data.Install with Tessl CLI
npx tessl i jbaruch/speaker-toolkit@0.6.2evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
skills
presentation-creator
references
patterns
build
deliver
prepare
rhetoric-knowledge-vault