CtrlK
BlogDocsLog inGet started
Tessl Logo

hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

80

Quality

100%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

report-template.mdassets/

AI Quality Evaluation Report — [Product / System name]

Period: [start – end] · Iteration: [N] · Dataset version: [vX] · Owner: [name]

Executive summary

[3 sentences max: current quality vs target, trend vs last iteration, the one decision needed from the reader.]

What we measure and why

  • Unit of evaluation: [single answer / summary / classification / agent run]
  • Definition of "correct": [criteria]
  • Validated by: [domain authority — names/roles, validation date]
  • Golden dataset: [N cases, version, coverage note]
  • Scoring method: [exact match / rubric / expert review / LLM judge] · Partial-credit policy: [strict / graded]
  • Key trade-off: [e.g. optimised for recall because a missed case costs more than a false alarm]

Results

MetricBaselinePreviousCurrentTargetTrend
[Answer accuracy][70%][85%][91%][90%]
[Faithfulness]
[Consistency]

Business translation: [metric change → operational effect → money or risk]

What changed this iteration

InterventionMeasured effect
[Source-data cleanup][+12pp accuracy]
[Retrieval tuning][+6pp accuracy]
[Prompt v4][+3pp accuracy]

Regressions

[Anything that got worse, even slightly. Write "None detected on dataset vX" if clean — never omit this section.]

Gaps and risks

  • [What isn't measured yet]
  • [Dataset blind spots]
  • [Drift risks — source data, model version, usage patterns]

Recommendations

  1. [Action] — Owner: [name] — Expected impact: [estimate]
  2. [Action] — Owner: [name] — Expected impact: [estimate]

Appendix: failed cases this iteration

Case IDInput summaryExpectedGotFailure type

SKILL.md

tile.json