Use when creating or editing skills, before deployment, to verify they work under pressure and resist rationalization - applies RED-GREEN-REFACTOR cycle to process documentation by running baseline without skill, writing to address failures, iterating to close loopholes
75
68%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/customaize-agent/skills/test-skill/SKILL.mdQuality
Discovery
75%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description has strong completeness with an explicit 'Use when' clause and a distinctive niche that minimizes conflict risk. However, the specificity of actions could be more concrete (what exactly does 'running baseline without skill' produce?), and the trigger terms lean toward internal jargon rather than natural user language, which could reduce discoverability.
Suggestions
Replace jargon like 'resist rationalization' and 'close loopholes' with more natural user-facing terms like 'test skill reliability', 'validate skill behavior', or 'QA skills'.
Add more concrete action descriptions — e.g., 'generates test scenarios, runs baseline comparisons, identifies failure cases, and iterates on skill content' instead of the abstract process description.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description names a domain (skill testing/verification) and mentions some actions like 'running baseline without skill', 'writing to address failures', 'iterating to close loopholes', and 'RED-GREEN-REFACTOR cycle'. However, the actions are somewhat abstract and process-oriented rather than concrete discrete operations. | 2 / 3 |
Completeness | The description explicitly answers both 'what' (applies RED-GREEN-REFACTOR cycle to process documentation by running baseline, writing to address failures, iterating) and 'when' ('Use when creating or editing skills, before deployment, to verify they work under pressure'). The 'Use when...' clause is present and specific. | 3 / 3 |
Trigger Term Quality | Contains some relevant terms like 'skills', 'deployment', 'RED-GREEN-REFACTOR', and 'rationalization', but many of these are technical jargon. Users might naturally say 'test my skill' or 'verify skill works' but the description's trigger terms like 'resist rationalization' and 'close loopholes' are less natural user phrases. | 2 / 3 |
Distinctiveness Conflict Risk | This skill occupies a very specific niche — testing and verifying skills using a RED-GREEN-REFACTOR methodology before deployment. It is unlikely to conflict with other skills due to its unique combination of skill verification, rationalization resistance, and the specific testing cycle described. | 3 / 3 |
Total | 10 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill excels at actionability and workflow clarity, providing concrete scenario templates, explicit before/after examples, and a well-structured RED-GREEN-REFACTOR cycle with validation checkpoints. However, it is severely verbose — the same concepts are restated multiple times across tables, examples, and summary sections, roughly doubling the necessary token count. The content would benefit significantly from aggressive deduplication and splitting detailed examples into referenced files.
Suggestions
Eliminate redundant restatements of the RED-GREEN-REFACTOR cycle — the TDD mapping table, quick reference table, testing checklist, and 'Bottom Line' section all say the same thing. Keep one authoritative version.
Move the detailed pressure scenario examples and pressure types table into a separate reference file (e.g., PRESSURE_SCENARIOS.md) and link to it, keeping only one concise example inline.
Remove explanatory content Claude already knows — e.g., 'This is identical to TDD's write failing test first', 'Same cycle as code TDD, different test format', and the entire 'The Bottom Line' section which restates the obvious.
Consolidate the 'Common Mistakes' section into the relevant phase sections rather than repeating guidance in a separate block.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~300+ lines with significant repetition. The TDD mapping table, quick reference table, and multiple sections restate the same RED-GREEN-REFACTOR cycle. Pressure scenario examples are shown three times with minor variations. The 'Bottom Line' and 'Real-World Impact' sections add no new information. Much content explains concepts Claude already knows (what TDD is, what pressure testing means). | 1 / 3 |
Actionability | The skill provides highly concrete, executable guidance: specific scenario templates with exact wording, before/after skill revision examples, explicit A/B/C choice formats, meta-testing prompts, and detailed rationalization table entries. The checklist format makes steps copy-paste actionable. | 3 / 3 |
Workflow Clarity | The RED-GREEN-REFACTOR workflow is clearly sequenced with explicit validation checkpoints at each phase. The checklist includes verification steps (re-test, meta-test), feedback loops for error recovery (if agent still fails → revise and re-test), and clear success/failure criteria for each phase. The 'When Skill is Bulletproof' section provides explicit completion criteria. | 3 / 3 |
Progressive Disclosure | References to external files exist (examples/CLAUDE_MD_TESTING.md, persuasion-principles.md, superpowers:test-driven-development) but no bundle files are provided to verify them. The skill itself is monolithic — the extensive pressure types table, common mistakes section, and multiple example scenarios could be split into separate reference files. The inline content is too long for what should be an overview. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
dedca19
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.