CtrlK
BlogDocsLog inGet started
Tessl Logo

hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

80

Quality

100%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

SKILL.md

name:
ai-eval-report
description:
Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

AI Eval Report

Turns evaluation data — or the absence of it — into a decision-ready AI quality report. The core principle: in AI products, the eval framework is the product spec. You can't optimise what you can't measure, and stakeholders can't trust what they can't see measured.

Choose the deliverable

Match the deliverable to where the project actually is:

  1. No eval process exists yet → produce an Eval Framework Proposal (golden dataset design + metric definitions). Read references/golden-dataset.md before drafting.
  2. Benchmark data already exists → produce a Quality Evaluation Report using the template below.
  3. Stakeholders are confused by or losing trust in the AI → produce a Stakeholder Explainer. Read references/stakeholder-explainer.md before drafting.

When the request is broad ("evaluate our AI"), default to 1 + 2 together: a framework without a report has no teeth, and a report without a framework has no credibility.

Workflow

Copy this checklist and check off items as you complete them:

Eval Report Progress:
- [ ] Step 1: Identify the unit of evaluation
- [ ] Step 2: Pin down the definition of "correct"
- [ ] Step 3: Collect inputs (dataset, scores, iteration history)
- [ ] Step 4: Select metrics and translate them to business impact
- [ ] Step 5: Draft the report from the template
- [ ] Step 6: Run the quality review loop

Step 1: Identify the unit of evaluation. What exactly is being scored — a single answer, a document summary, a multi-step agent run, a classification? Reports that mix units produce numbers nobody can act on. If the user hasn't specified, ask before proceeding.

Step 2: Pin down the definition of "correct". Who decides an output is right: domain experts, the client, an automated comparison against reference answers, or an LLM judge? Name the authority explicitly in the report. If no definition exists, this becomes the first recommendation, not a footnote.

Step 3: Collect inputs. Gather whatever exists: benchmark scores, golden dataset, iteration history, incident examples, stakeholder complaints. Missing inputs are findings in themselves — record them in the Gaps section rather than silently working around them.

Step 4: Select metrics. Read references/metrics.md for metric selection by task type and for translating technical metrics into business language. Every technical number in the report needs a business consequence next to it.

Step 5: Draft the report. Use the template below. For a copy-ready file version, use assets/report-template.md.

Step 6: Run the quality review loop. Check the draft against the review checklist at the end of this file. Fix every miss, then re-check. Only deliver when all checks pass.

Report structure

Use this template. Keep the section order — executives read top-down and stop early:

# AI Quality Evaluation Report — [Product / System]
**Period:** [dates] · **Iteration:** [N] · **Owner:** [name]

## Executive summary
[3 sentences max: current quality vs target, trend vs last iteration, the one decision needed.]

## What we measure and why
[Unit of evaluation, definition of "correct", who validates, dataset size and version.]

## Results
| Metric | Baseline | Previous | Current | Target | Trend |
|--------|----------|----------|---------|--------|-------|

## What changed this iteration
[Interventions made — data changes, prompt versions, model settings — each linked to its measured effect.]

## Regressions
[Anything that got worse, even slightly. Never omit this section — write "none detected" if clean.]

## Gaps and risks
[What isn't measured yet, dataset blind spots, drift risks.]

## Recommendations
[Numbered, each with owner and expected impact. Answers "what do we do Monday".]

Example

Input: "We improved our RAG pipeline's accuracy from 70% to 91% over 6 weeks. Client wants a quality report."

Output (executive summary section):

## Executive summary
Answer accuracy on the 48-case golden dataset reached 91% (target: 90%), up from
70% at baseline six weeks ago. The largest gain came from source-data cleanup
(+12pp), followed by retrieval tuning (+6pp) and prompt iteration (+3pp).
Decision needed: approve human-in-the-loop review for the remaining ~9% of cases
before production sign-off.

Note what makes this work: the number has a baseline and a target, gains are attributed to specific interventions, and it ends with a decision — not a celebration.

Quality review loop

Before delivering, verify the draft against each check. If any check fails, revise and re-verify:

  • Every number has a baseline, a comparison point, and a target — a lone "91%" means nothing.
  • Gains are attributed to specific interventions, so the reader learns what actually moved quality.
  • The Regressions section exists even when empty — silent regressions destroy trust faster than visible failures.
  • Each recommendation has an owner and expected impact.
  • Client-facing sections contain no unexplained jargon (see references/stakeholder-explainer.md for phrasing probabilistic behaviour).
  • The report answers "what do we do next", not just "how did we do".

Reference files

SKILL.md

tile.json