CtrlK
BlogDocsLog inGet started
Tessl Logo

hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

80

Quality

100%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

metrics.mdreferences/

Metric Selection and Business Translation

Pick metrics by task type

Task typePrimary metricsWatch out for
Question answering / RAGAnswer accuracy, faithfulness (grounded in sources), completenessHigh accuracy with low faithfulness = confident hallucination
ClassificationPrecision, recall, F1Report per-class, not just aggregate — one rare class can hide total failure
SummarisationFaithfulness, coverage of key points, length complianceFluency is not quality; a fluent wrong summary is worse than a clumsy right one
ExtractionField-level precision/recall, format validityAggregate scores hide which fields consistently fail
Agent / multi-stepTask completion rate, step accuracy, recovery rate after errorsEnd-to-end success can mask fragile intermediate steps

For any LLM-based system, also track consistency: same input, same output. In regulated or client-facing products, reproducibility is a feature — deterministic settings (temperature 0, fixed seeds where available) should be recorded alongside scores because they change what the numbers mean.

Precision vs recall — decide which failure is cheaper

Every threshold trades false positives against false negatives. Make the trade-off explicit in the report:

  • Medical screening: missed case (low recall) is catastrophic; false alarm is a review cost → optimise recall.
  • Compliance flagging: same logic — err toward recall.
  • Content recommendation: irrelevant result is cheap → precision matters more.

State the chosen trade-off and its rationale in "What we measure and why". A number without its trade-off invites the wrong optimisation.

Translate technical metrics into business impact

Every technical number in a report needs a business consequence next to it. Executives fund consequences, not F1 scores.

Translation pattern: [metric change] → [operational effect] → [money or risk]

Examples:

  • "Accuracy 70% → 91% means expert reviewers now correct ~1 in 11 answers instead of 1 in 3 — reviewer time drops roughly 70%."
  • "Recall improved 82% → 94%: an estimated 12 additional relevant cases per 100 are caught before they reach the client."
  • "Consistency at 100% (deterministic settings): the client never sees two different answers to the same question — a contractual requirement in this engagement."

Avoid unanchored claims ("quality improved significantly"). If the business effect can't be estimated yet, say so and make estimating it a recommendation.

Reporting rules

  • Always show baseline → previous → current → target. Trends persuade; snapshots don't.
  • Attribute changes to interventions (data cleanup, retrieval change, prompt version) — otherwise the team learns nothing about what works.
  • Report cost per evaluation run and latency alongside quality when they constrain the product; a 2-point accuracy gain that triples latency is a product decision, not a win.

SKILL.md

tile.json