Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".
80
100%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Turns evaluation data — or the absence of it — into a decision-ready AI quality report. The core principle: in AI products, the eval framework is the product spec. You can't optimise what you can't measure, and stakeholders can't trust what they can't see measured.
Match the deliverable to where the project actually is:
When the request is broad ("evaluate our AI"), default to 1 + 2 together: a framework without a report has no teeth, and a report without a framework has no credibility.
Copy this checklist and check off items as you complete them:
Eval Report Progress:
- [ ] Step 1: Identify the unit of evaluation
- [ ] Step 2: Pin down the definition of "correct"
- [ ] Step 3: Collect inputs (dataset, scores, iteration history)
- [ ] Step 4: Select metrics and translate them to business impact
- [ ] Step 5: Draft the report from the template
- [ ] Step 6: Run the quality review loopStep 1: Identify the unit of evaluation. What exactly is being scored — a single answer, a document summary, a multi-step agent run, a classification? Reports that mix units produce numbers nobody can act on. If the user hasn't specified, ask before proceeding.
Step 2: Pin down the definition of "correct". Who decides an output is right: domain experts, the client, an automated comparison against reference answers, or an LLM judge? Name the authority explicitly in the report. If no definition exists, this becomes the first recommendation, not a footnote.
Step 3: Collect inputs. Gather whatever exists: benchmark scores, golden dataset, iteration history, incident examples, stakeholder complaints. Missing inputs are findings in themselves — record them in the Gaps section rather than silently working around them.
Step 4: Select metrics. Read references/metrics.md for metric selection by task type and for translating technical metrics into business language. Every technical number in the report needs a business consequence next to it.
Step 5: Draft the report. Use the template below. For a copy-ready file version, use assets/report-template.md.
Step 6: Run the quality review loop. Check the draft against the review checklist at the end of this file. Fix every miss, then re-check. Only deliver when all checks pass.
Use this template. Keep the section order — executives read top-down and stop early:
# AI Quality Evaluation Report — [Product / System]
**Period:** [dates] · **Iteration:** [N] · **Owner:** [name]
## Executive summary
[3 sentences max: current quality vs target, trend vs last iteration, the one decision needed.]
## What we measure and why
[Unit of evaluation, definition of "correct", who validates, dataset size and version.]
## Results
| Metric | Baseline | Previous | Current | Target | Trend |
|--------|----------|----------|---------|--------|-------|
## What changed this iteration
[Interventions made — data changes, prompt versions, model settings — each linked to its measured effect.]
## Regressions
[Anything that got worse, even slightly. Never omit this section — write "none detected" if clean.]
## Gaps and risks
[What isn't measured yet, dataset blind spots, drift risks.]
## Recommendations
[Numbered, each with owner and expected impact. Answers "what do we do Monday".]Input: "We improved our RAG pipeline's accuracy from 70% to 91% over 6 weeks. Client wants a quality report."
Output (executive summary section):
## Executive summary
Answer accuracy on the 48-case golden dataset reached 91% (target: 90%), up from
70% at baseline six weeks ago. The largest gain came from source-data cleanup
(+12pp), followed by retrieval tuning (+6pp) and prompt iteration (+3pp).
Decision needed: approve human-in-the-loop review for the remaining ~9% of cases
before production sign-off.Note what makes this work: the number has a baseline and a target, gains are attributed to specific interventions, and it ends with a decision — not a celebration.
Before delivering, verify the draft against each check. If any check fails, revise and re-verify: