CtrlK
BlogDocsLog inGet started
Tessl Logo

hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

80

Quality

100%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

golden-dataset.mdreferences/

Designing a Golden Dataset

A golden dataset is the versioned set of inputs with validated reference answers that every iteration is scored against. It converts "the AI feels better" into "the AI scored 91% on v3 of the dataset".

Design principles

Representative coverage beats size. Start with 30–50 cases that mirror real usage distribution: common cases, known hard cases, and edge cases in roughly the proportion users hit them. A 500-case dataset of easy questions inflates scores and hides real weaknesses.

Reference answers need a named authority. Each reference answer must be validated by someone accountable for correctness — domain experts (e.g. physicians for medical content, lawyers for legal), the client's own specialists, or both. Record who validated what. An unvalidated reference answer is just another AI-quality opinion.

Separate the three validation roles:

RoleOwns
Domain expertFactual correctness of reference answers
Data scientist / engineerScoring consistency and methodology
Product managerProcess, sign-off loop, and dispute resolution

Define the partial-credit policy upfront. Decide before scoring: does an incomplete answer count as a pass? In regulated domains (healthcare, finance, legal), default to strict — "mostly right" is wrong. In creative or exploratory products, graded scoring (0 / 0.5 / 1) is often more informative. Whichever policy applies, write it into the framework so scores stay comparable across iterations.

Version the dataset like code. Any change to cases or reference answers gets a new version number, and scores are only comparable within the same version. When reporting across versions, state the version change explicitly.

Build sequence

  1. Pull 30–50 real inputs from production usage, support tickets, or client workflows — not invented examples.
  2. Draft reference answers.
  3. Get each reference answer validated by the domain authority; log validator and date.
  4. Agree the scoring method (exact match, rubric, expert review, LLM judge) and the partial-credit policy.
  5. Freeze as v1. Score the current system to establish the baseline before changing anything.

Maintenance

  • Add new failure cases from production to the next dataset version — real failures are the most valuable cases you'll ever get.
  • Re-validate reference answers on a set cadence (quarterly is a sensible default) — domain knowledge drifts.
  • Retire cases the system passes trivially for several consecutive iterations, replacing them with harder ones, and note the swap in the version changelog.

Common failure modes

  • Dataset written by the same people who built the pipeline — blind spots transfer directly. Include at least one outside validator.
  • Only happy-path cases — score looks great, production disappoints. Deliberately include ambiguous inputs, malformed inputs, and questions with no answer in the source data (the correct output is "not found", and systems that guess instead should fail the case).
  • Moving-target references — reference answers quietly edited to match system output. Versioning and named validators prevent this.

SKILL.md

tile.json