hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

Quality

100%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

AI Quality Evaluation Report — [Product / System name]

Name: hannaklim/ai-eval-report
Rating: 80 (1 reviews)
Author: hannaklim

Period: [start – end] · Iteration: [N] · Dataset version: [vX] · Owner: [name]

Executive summary

[3 sentences max: current quality vs target, trend vs last iteration, the one decision needed from the reader.]

What we measure and why

Unit of evaluation: [single answer / summary / classification / agent run]
Definition of "correct": [criteria]
Validated by: [domain authority — names/roles, validation date]
Golden dataset: [N cases, version, coverage note]
Scoring method: [exact match / rubric / expert review / LLM judge] · Partial-credit policy: [strict / graded]
Key trade-off: [e.g. optimised for recall because a missed case costs more than a false alarm]

Results

Metric	Baseline	Previous	Current	Target	Trend
[Answer accuracy]	[70%]	[85%]	[91%]	[90%]	↑
[Faithfulness]
[Consistency]

Business translation: [metric change → operational effect → money or risk]

What changed this iteration

Intervention	Measured effect
[Source-data cleanup]	[+12pp accuracy]
[Retrieval tuning]	[+6pp accuracy]
[Prompt v4]	[+3pp accuracy]

Regressions

[Anything that got worse, even slightly. Write "None detected on dataset vX" if clean — never omit this section.]

Gaps and risks

[What isn't measured yet]
[Dataset blind spots]
[Drift risks — source data, model version, usage patterns]

Recommendations

[Action] — Owner: [name] — Expected impact: [estimate]
[Action] — Owner: [name] — Expected impact: [estimate]

Appendix: failed cases this iteration

Case ID	Input summary	Expected	Got	Failure type

.tessl-plugin

assets

report-template.md

references

SKILL.md

tile.json