hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

Quality

100%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Designing a Golden Dataset

Name: hannaklim/ai-eval-report
Rating: 80 (1 reviews)
Author: hannaklim

A golden dataset is the versioned set of inputs with validated reference answers that every iteration is scored against. It converts "the AI feels better" into "the AI scored 91% on v3 of the dataset".

Design principles

Representative coverage beats size. Start with 30–50 cases that mirror real usage distribution: common cases, known hard cases, and edge cases in roughly the proportion users hit them. A 500-case dataset of easy questions inflates scores and hides real weaknesses.

Reference answers need a named authority. Each reference answer must be validated by someone accountable for correctness — domain experts (e.g. physicians for medical content, lawyers for legal), the client's own specialists, or both. Record who validated what. An unvalidated reference answer is just another AI-quality opinion.

Separate the three validation roles:

Role	Owns
Domain expert	Factual correctness of reference answers
Data scientist / engineer	Scoring consistency and methodology
Product manager	Process, sign-off loop, and dispute resolution

Define the partial-credit policy upfront. Decide before scoring: does an incomplete answer count as a pass? In regulated domains (healthcare, finance, legal), default to strict — "mostly right" is wrong. In creative or exploratory products, graded scoring (0 / 0.5 / 1) is often more informative. Whichever policy applies, write it into the framework so scores stay comparable across iterations.

Version the dataset like code. Any change to cases or reference answers gets a new version number, and scores are only comparable within the same version. When reporting across versions, state the version change explicitly.

Build sequence

Pull 30–50 real inputs from production usage, support tickets, or client workflows — not invented examples.
Draft reference answers.
Get each reference answer validated by the domain authority; log validator and date.
Agree the scoring method (exact match, rubric, expert review, LLM judge) and the partial-credit policy.
Freeze as v1. Score the current system to establish the baseline before changing anything.

Maintenance

Add new failure cases from production to the next dataset version — real failures are the most valuable cases you'll ever get.
Re-validate reference answers on a set cadence (quarterly is a sensible default) — domain knowledge drifts.
Retire cases the system passes trivially for several consecutive iterations, replacing them with harder ones, and note the swap in the version changelog.

Common failure modes

Dataset written by the same people who built the pipeline — blind spots transfer directly. Include at least one outside validator.
Only happy-path cases — score looks great, production disappoints. Deliberately include ambiguous inputs, malformed inputs, and questions with no answer in the source data (the correct output is "not found", and systems that guess instead should fail the case).
Moving-target references — reference answers quietly edited to match system output. Versioning and named validators prevent this.