Evaluate solutions through multi-round debate between independent judges until consensus
48
36%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/sadd/skills/judge-with-debate/SKILL.mdQuality
Discovery
17%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description captures a unique evaluation methodology (multi-round debate with judges reaching consensus) but fails to provide natural trigger terms users would actually use, lacks a 'Use when...' clause, and doesn't specify what kinds of solutions or outputs are involved. It reads more like an internal process description than a skill selection guide.
Suggestions
Add a 'Use when...' clause with natural trigger terms like 'evaluate my approach', 'compare solutions', 'critique this design', 'get multiple perspectives on', or 'review pros and cons'.
Specify what types of solutions or artifacts this evaluates (e.g., code architectures, design proposals, technical approaches) and what output it produces (e.g., a consensus recommendation with reasoning).
Include user-facing language variations such as 'weigh tradeoffs', 'devil's advocate', 'stress-test an idea', or 'adversarial review' to improve trigger term coverage.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (solution evaluation) and a key mechanism (multi-round debate between independent judges until consensus), but doesn't list multiple concrete actions or specify what kinds of solutions, what the judges do, or what outputs are produced. | 2 / 3 |
Completeness | It partially addresses 'what' (evaluate solutions through debate) but completely lacks a 'when' clause or any explicit trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' is also underspecified, so this scores a 1. | 1 / 3 |
Trigger Term Quality | The terms 'multi-round debate', 'independent judges', and 'consensus' are technical/methodological jargon that users would rarely naturally say. Users are more likely to say things like 'evaluate my idea', 'compare options', 'critique this solution', or 'get multiple perspectives'. | 1 / 3 |
Distinctiveness Conflict Risk | The 'multi-round debate between independent judges' mechanism is somewhat distinctive, but 'evaluate solutions' is broad enough to overlap with general code review, analysis, or comparison skills. The lack of specificity about what types of solutions increases conflict risk. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
55%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill provides a thorough, actionable multi-agent debate evaluation framework with excellent workflow clarity and concrete prompt templates. However, it is extremely verbose with significant repetition—the same concepts are restated across the process description, orchestration instructions, best practices, and example sections. The monolithic structure with no progressive disclosure makes it a poor fit for token-efficient skill design.
Suggestions
Reduce content by at least 50%: eliminate the redundant 'Orchestration Instructions' section (Steps 1-7) which restates the Phases, merge 'Best Practices' into inline notes within the relevant phases, and remove the 'Context' section explaining why debate is useful.
Split into multiple files: move prompt templates to a PROMPTS.md file, the worked example to EXAMPLE.md, and keep SKILL.md as a concise overview with the ASCII diagram, phase summaries, and file references.
Remove explanatory content Claude already knows: the entire <context> section explaining benefits of multi-agent debate, explanations like 'Independence in initial analysis prevents groupthink', and 'Key principle' callouts are unnecessary padding.
Consolidate the initial judge and debate judge prompt templates into a single parameterized template with notes on what changes between phases, rather than repeating nearly identical templates.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~300+ lines. Massive amounts of repetition (the orchestration steps essentially restate the phases, the best practices restate what was already said, the example walks through the entire process again). Claude doesn't need explanations of what debate is, what consensus means, or why multiple perspectives reduce bias. The prompt templates are repeated with minor variations for each phase. | 1 / 3 |
Actionability | Despite verbosity, the skill provides fully concrete, executable guidance: specific prompt templates with placeholders, exact file naming conventions, precise consensus thresholds (0.5 points overall, 1 point per criterion), Task tool dispatch patterns, and a detailed worked example showing the full flow with realistic scores and debate arguments. | 3 / 3 |
Workflow Clarity | The multi-step process is clearly sequenced with explicit phases (0 → 0.5 → 1 → 2 → 3), decision points (consensus check with specific numeric thresholds), feedback loops (debate rounds with re-validation), and a clear ASCII diagram. The orchestration instructions provide an unambiguous step-by-step algorithm with branching logic. | 3 / 3 |
Progressive Disclosure | Monolithic wall of text with no references to external files. Everything is inline in a single massive document. The prompt templates, best practices, example usage, and orchestration logic could all be split into separate referenced files. No bundle files are provided, and the content doesn't attempt any structural separation. | 1 / 3 |
Total | 8 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
dedca19
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.