Stops fantasy approvals, evidence-based certification - Default to "NEEDS WORK", requires overwhelming proof for production readiness
36
21%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./testing-reality-checker/skills/SKILL.mdQuality
Discovery
7%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description reads more like an opinionated tagline or mission statement than a functional skill description. It fails to specify concrete actions, lacks natural trigger terms users would use, and provides no explicit guidance on when Claude should select this skill. The attitude-driven language ('fantasy approvals', 'overwhelming proof') obscures the actual functionality.
Suggestions
Replace abstract language with concrete actions, e.g., 'Reviews code changes, validates test coverage, checks deployment criteria, and gates production approvals based on evidence.'
Add an explicit 'Use when...' clause with natural trigger terms like 'code review', 'approve PR', 'deploy to production', 'release readiness check', 'QA gate'.
Remove opinionated/editorial language ('fantasy approvals', 'overwhelming proof') and replace with objective capability descriptions that help Claude match this skill to user requests.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses vague, abstract language like 'fantasy approvals' and 'evidence-based certification' without listing concrete actions. It describes an attitude/philosophy rather than specific capabilities like 'reviews pull requests', 'checks test coverage', or 'validates deployment criteria'. | 1 / 3 |
Completeness | The 'what' is vaguely implied (some kind of certification/approval process) but not clearly stated, and there is no 'when' clause or explicit trigger guidance at all. Missing a 'Use when...' clause caps this at 2, but the 'what' is also too weak, resulting in a 1. | 1 / 3 |
Trigger Term Quality | The terms used ('fantasy approvals', 'overwhelming proof', 'production readiness') are not natural keywords a user would say. Users would more likely say 'code review', 'approve PR', 'deployment check', or 'release readiness'. The language is more like internal jargon or opinionated branding. | 1 / 3 |
Distinctiveness Conflict Risk | The description has a somewhat distinctive tone and niche (strict approval/certification gating), which gives it some identity. However, without concrete triggers, it could overlap with any code review, QA, or deployment skill. | 2 / 3 |
Total | 5 / 12 Passed |
Implementation
35%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is extremely verbose, spending significant tokens on persona definition, motivational framing, and repeated emphasis on skepticism that Claude doesn't need. While it contains some actionable elements (bash commands, report template structure, automatic fail triggers), much of the guidance is qualitative and descriptive rather than providing concrete, measurable decision criteria. The content would benefit greatly from being split into a concise overview with references to detailed templates and methodology files.
Suggestions
Cut persona/identity/memory/communication style sections entirely—these consume tokens without adding actionable guidance. Reduce to a single sentence about defaulting to 'NEEDS WORK' status.
Move the large report template and methodology sections to separate referenced files (e.g., REPORT_TEMPLATE.md, METHODOLOGY.md) and keep only the core workflow steps in SKILL.md.
Add concrete, measurable pass/fail thresholds for each validation step (e.g., specific accessibility scores, exact responsive breakpoint requirements, interaction success rates from test-results.json).
Add an explicit feedback loop: what to do when validation fails at each step, how to communicate findings, and when to trigger re-assessment.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~200+ lines. Extensive sections on personality, identity, memory, communication style, learning patterns, and success metrics are unnecessary—Claude doesn't need motivational framing or persona backstory. The report template alone is massive and could be a separate reference file. Much content restates the same ideas (e.g., 'stop fantasy approvals' repeated in multiple sections). | 1 / 3 |
Actionability | The STEP 1 bash commands are concrete and executable, and the report template provides specific structure. However, much of the content is descriptive rather than instructive—markdown templates with bracketed placeholders like '[Honest description]' are not truly executable. The methodology sections describe what to analyze but rely on vague directives like 'review' and 'analyze' without concrete decision criteria. | 2 / 3 |
Workflow Clarity | There is a 3-step process (Reality Check → QA Cross-Validation → End-to-End Validation) which provides sequence, but Steps 2 and 3 lack concrete validation checkpoints or explicit pass/fail criteria. The 'AUTOMATIC FAIL' triggers provide some decision boundaries but are qualitative rather than measurable (e.g., '>3 second load times' is the only concrete threshold). No feedback loop for error recovery is defined. | 2 / 3 |
Progressive Disclosure | There is a reference to `ai/agents/integration.md` at the end for detailed methodology, which is good. However, the massive report template and multiple methodology sections should be in separate reference files rather than inline. The skill tries to be both an overview and a complete reference, resulting in a monolithic document. | 2 / 3 |
Total | 7 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
09aef5d
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.