Generates AI quality evaluation reports for LLM and ML-powered products — designs golden datasets, defines accuracy metrics, tracks quality across iterations, and produces stakeholder-ready summaries that explain probabilistic behaviour in business language. Use when evaluating AI or LLM output quality, building an eval framework or golden dataset, benchmarking accuracy between releases or prompt versions, reporting AI quality to clients or executives, or when the user asks "how good is our AI", "accuracy report", "eval results", "benchmark the model", or "why does the AI give different answers to the same question".
80
100%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Period: [start – end] · Iteration: [N] · Dataset version: [vX] · Owner: [name]
[3 sentences max: current quality vs target, trend vs last iteration, the one decision needed from the reader.]
| Metric | Baseline | Previous | Current | Target | Trend |
|---|---|---|---|---|---|
| [Answer accuracy] | [70%] | [85%] | [91%] | [90%] | ↑ |
| [Faithfulness] | |||||
| [Consistency] |
Business translation: [metric change → operational effect → money or risk]
| Intervention | Measured effect |
|---|---|
| [Source-data cleanup] | [+12pp accuracy] |
| [Retrieval tuning] | [+6pp accuracy] |
| [Prompt v4] | [+3pp accuracy] |
[Anything that got worse, even slightly. Write "None detected on dataset vX" if clean — never omit this section.]
| Case ID | Input summary | Expected | Got | Failure type |
|---|