Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
43
54%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
14%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description only specifies when to use the skill but entirely omits what the skill does, making it fundamentally incomplete. The trigger terms are reasonable but not comprehensive, and the extreme breadth ('any bug, test failure, or unexpected behavior') creates high conflict risk with other debugging or testing skills.
Suggestions
Add concrete actions describing what the skill does, e.g., 'Systematically diagnoses root causes by analyzing error messages, tracing code paths, and reproducing failures' or similar.
Expand trigger terms to include common variations like 'error', 'crash', 'broken', 'debugging', 'stack trace', 'exception', 'failing tests'.
Narrow the scope or add distinguishing details to reduce overlap with other debugging/testing skills, e.g., specifying a particular methodology or approach this skill uses.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description does not list any concrete actions or capabilities. It says nothing about what the skill actually does — no verbs like 'analyze', 'diagnose', 'trace', etc. It only describes when to use it. | 1 / 3 |
Completeness | The description answers 'when' (before proposing fixes for bugs/failures) but completely omits 'what' — there is no indication of what the skill actually does. This is the inverse of the typical problem but equally incomplete. | 1 / 3 |
Trigger Term Quality | It includes some natural trigger terms like 'bug', 'test failure', and 'unexpected behavior' that users might naturally say. However, it misses common variations like 'error', 'crash', 'broken', 'failing tests', 'debugging', 'stack trace', etc. | 2 / 3 |
Distinctiveness Conflict Risk | The description is extremely broad — 'any bug, test failure, or unexpected behavior' could overlap with virtually any debugging, testing, code review, or troubleshooting skill. Without specifying what it does, it's impossible to distinguish from other skills. | 1 / 3 |
Total | 5 / 12 Passed |
Implementation
47%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill has excellent workflow structure with clear phases, gates, and escalation paths, making it strong on workflow clarity. However, it is significantly over-verbose, repeating the same core principle (investigate before fixing) in numerous forms across red flags, rationalizations, excuses tables, and motivational framing. The actionability is moderate — the multi-component diagnostic example is good, but most content is philosophical guidance rather than executable techniques.
Suggestions
Cut content by 50%+: Remove the 'Common Rationalizations' table, 'Red Flags' list, and 'Human Partner's Signals' section — these all restate the same principle Claude can internalize from the four-phase process alone.
Add concrete executable examples to Phase 1 (e.g., specific git commands for checking recent changes, specific debugging commands) and Phase 3 (e.g., a concrete example of forming and testing a hypothesis with actual code).
Inline a brief version of the backward tracing technique from root-cause-tracing.md rather than just deferring to it, since it's central to Phase 1 step 5.
Remove the 'Real-World Impact' statistics section at the end — these are unverifiable claims that waste tokens without adding actionable guidance.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~250+ lines. Extensive sections on rationalizations, red flags, 'your human partner's signals,' and motivational content ('Random fixes waste time') are things Claude already knows or doesn't need repeated in multiple forms. The same core message (don't guess, find root cause first) is restated in at least 5 different ways across multiple sections. The 'When to Use' section with its 'Don't skip when' and 'Use this ESPECIALLY when' subsections is largely unnecessary padding. | 1 / 3 |
Actionability | The multi-component diagnostic example with bash commands is concrete and useful, and the four-phase process provides a clear framework. However, most guidance is procedural/philosophical rather than executable — it tells Claude what to think rather than providing specific commands or code patterns to run. The 'backward tracing' technique is deferred to another file without a quick inline example. | 2 / 3 |
Workflow Clarity | The four-phase workflow is clearly sequenced with explicit gates between phases ('MUST complete each phase before proceeding'). Phase 4 includes validation checkpoints (create failing test, verify fix, stop-and-reassess after 3 failures), and there are clear feedback loops (hypothesis fails → return to Phase 1, 3+ fixes failed → question architecture). The escalation path is well-defined. | 3 / 3 |
Progressive Disclosure | References to supporting files (root-cause-tracing.md, defense-in-depth.md, condition-based-waiting.md) and related skills are clearly signaled at the end. However, the main SKILL.md itself is monolithic — the rationalizations table, red flags, common excuses, and 'human partner signals' sections could be split out or removed entirely. No bundle files were provided to verify referenced paths exist. | 2 / 3 |
Total | 8 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents