hannaklim/ai-eval-report

Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".

Quality

100%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Metric Selection and Business Translation

Name: hannaklim/ai-eval-report
Rating: 80 (1 reviews)
Author: hannaklim

Pick metrics by task type

Task type	Primary metrics	Watch out for
Question answering / RAG	Answer accuracy, faithfulness (grounded in sources), completeness	High accuracy with low faithfulness = confident hallucination
Classification	Precision, recall, F1	Report per-class, not just aggregate — one rare class can hide total failure
Summarisation	Faithfulness, coverage of key points, length compliance	Fluency is not quality; a fluent wrong summary is worse than a clumsy right one
Extraction	Field-level precision/recall, format validity	Aggregate scores hide which fields consistently fail
Agent / multi-step	Task completion rate, step accuracy, recovery rate after errors	End-to-end success can mask fragile intermediate steps

For any LLM-based system, also track consistency: same input, same output. In regulated or client-facing products, reproducibility is a feature — deterministic settings (temperature 0, fixed seeds where available) should be recorded alongside scores because they change what the numbers mean.

Precision vs recall — decide which failure is cheaper

Every threshold trades false positives against false negatives. Make the trade-off explicit in the report:

Medical screening: missed case (low recall) is catastrophic; false alarm is a review cost → optimise recall.
Compliance flagging: same logic — err toward recall.
Content recommendation: irrelevant result is cheap → precision matters more.

State the chosen trade-off and its rationale in "What we measure and why". A number without its trade-off invites the wrong optimisation.

Translate technical metrics into business impact

Every technical number in a report needs a business consequence next to it. Executives fund consequences, not F1 scores.

Translation pattern: [metric change] → [operational effect] → [money or risk]

Examples:

"Accuracy 70% → 91% means expert reviewers now correct ~1 in 11 answers instead of 1 in 3 — reviewer time drops roughly 70%."
"Recall improved 82% → 94%: an estimated 12 additional relevant cases per 100 are caught before they reach the client."
"Consistency at 100% (deterministic settings): the client never sees two different answers to the same question — a contractual requirement in this engagement."

Avoid unanchored claims ("quality improved significantly"). If the business effect can't be estimated yet, say so and make estimating it a recommendation.

Reporting rules

Always show baseline → previous → current → target. Trends persuade; snapshots don't.
Attribute changes to interventions (data cleanup, retrieval change, prompt version) — otherwise the team learns nothing about what works.
Report cost per evaluation run and latency alongside quality when they constrain the product; a 2-point accuracy gain that triples latency is a product decision, not a win.