Name: hannaklim/ai-eval-report
Rating: 80 (1 reviews)
Author: hannaklim

hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

Quality

100%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

100%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

A well-structured, instruction-only skill body that is lean, highly actionable, and properly split across one-level-deep references. The workflow is clearly sequenced with an explicit quality review feedback loop, and all referenced bundle files exist.

Dimension	Reasoning	Score
Conciseness	The body is lean and assumes Claude's competence; it avoids explaining what AI/evals are conceptually and every section earns its place, fitting the 'every token earns its place' anchor.	3 / 3
Actionability	Provides a concrete report template with exact sections, a worked input/output example, a copy-ready template asset, and specific checks ('Every number has a baseline, a comparison point, and a target'). Per scoring notes, absence of code is not penalized for instruction-only skills.	3 / 3
Workflow Clarity	A 6-step sequenced workflow with a copy-paste checklist, plus an explicit validate→fix→re-check feedback loop ('Fix every miss, then re-check. Only deliver when all checks pass') matches the top anchor for clear sequences with checkpoints.	3 / 3
Progressive Disclosure	Clear overview with one-level-deep, well-signaled references — golden-dataset.md, metrics.md, stakeholder-explainer.md, report-template.md (all verified to exist) — each linked at the point of need and summarized in a Reference files section.	3 / 3
	Total	12 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

A high-quality description that pairs concrete capabilities with an explicit, naturally-worded 'Use when' clause and good trigger coverage. It clearly states both what the skill does and when to invoke it, with minimal risk of conflicting with other skills.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions — 'designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries' — matching the highest anchor for specific capabilities.	3 / 3
Completeness	Clearly answers both 'what' (generates reports, designs datasets, defines metrics) and 'when' via an explicit 'Use when...' clause with multiple triggers, satisfying the top anchor.	3 / 3
Trigger Term Quality	Includes natural user phrasings a person would actually say, e.g. "how good is our AI", "accuracy report", "eval results", "benchmark the model", giving strong trigger coverage.	3 / 3
Distinctiveness Conflict Risk	Has a clear AI-eval-reporting niche with distinct triggers ('how good is our AI', 'benchmark the model') unlikely to fire for unrelated skills.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 16 / 16 Passed

Validation for skill structure

No warnings or errors.

Reviewed

3 days ago

Table of Contents

Discovery Implementation Validation