jbaruch/speaker-toolkit

Four-skill presentation system: ingest talks into a rhetoric vault, run interactive clarification, generate a speaker profile, then create new presentations that match your documented patterns. Includes an 88-entry Presentation Patterns taxonomy for scoring, brainstorming, and go-live preparation.

1.21x

Quality

93%

Does it follow best practices?

Impact

97%

1.21x

Average score across 30 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Video Slide Extraction Quality Diagnostics

Name: jbaruch/speaker-toolkit
Rating: 96.6 (1 reviews)
Author: jbaruch

Problem/Feature Description

Six conference talk recordings have been processed through the video-slide-extraction pipeline. The extraction results vary widely — some are clean, others show signs of wide-angle recording dedup failure, missing transcripts, wrong languages, wrong speakers, or Whisper hallucination.

Analyze the extraction results and produce a structured diagnostics report for each case.

Setup

Download the fixed extraction results:

curl -sLO https://github.com/jbaruch/speaker-toolkit/raw/main/eval-resources/scenario-13/extraction_results.json

Task

Analyze extraction_results.json (6 recordings) and produce diagnostics_report.json containing a per-recording diagnostic entry. Each entry must include:

Recording type classification — classify each recording into one of at least 3 categories (e.g., fullscreen slides, picture-in-picture, wide-angle room) based on extraction metrics (frame-to-unique ratio, slide region detection)
Dedup quality assessment — for case_wide_angle (1200 frames → 900 "unique"), detect that the 1.33:1 ratio indicates dedup failure from speaker movement, and recommend:
- Increasing hash threshold to 14–18 (above the default 8)
- Using manual slide region cropping to isolate the projected screen
Transcript quality assessment for each case:
- case_no_transcript: flag as missing, recommend Whisper fallback
- case_wrong_language: detect Russian transcript when English was expected — but do NOT flag as an error if the talk was actually delivered in Russian (the speaker is bilingual)
- case_whisper_hallucination: detect the 0.45 repetition ratio and the repeated "Thank you for watching" loop as hallucination artifacts
- Track transcript source (youtube_auto_captions vs whisper_transcription) as a field
Speaker identity verification — for case_wrong_speaker, flag that detected speaker "James Gosling" doesn't match expected "Baruch Sadogursky" (possible wrong recording from a playlist)
Clean pass-through — for case_clean (50 frames → 45 unique, valid transcript, correct speaker), produce NO warnings. Classify as healthy.

Output Specification

Produce diagnostics_report.json with this structure per case:

recording_type: classification string
dedup_quality: assessment with optional recommended_threshold
transcript_quality: assessment with source, status, and optional warnings
speaker_match: boolean + details
recommendations: list of actionable strings (empty for clean recordings)

Also produce recommendations_log.txt — a human-readable summary of all flagged issues and recommendations, suitable for a production operator to review.

evals

scenario-1

scenario-2

scenario-3

scenario-4

scenario-5

scenario-6

scenario-7

scenario-8

scenario-9

scenario-10

scenario-11

scenario-12

scenario-13

scenario-14

scenario-15

scenario-16

scenario-17

scenario-18

scenario-19

scenario-20

scenario-21

scenario-22

scenario-23

scenario-24

scenario-25

scenario-26

scenario-27

scenario-28

scenario-29

scenario-30

criteria.json

task.md

rules

skills

README.md

tile.json