Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.
100
100%
Does it follow best practices?
Impact
100%
1.28xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies domain-specific maturity scoring (0-4 scale without averaging), uses the required report section headings, and applies key guardrails — particularly around LLM-as-judge calibration, transcript evidence requirements, and post-hoc redaction claims. The PromptTuner system has several issues: uncalibrated LLM-as-judge, sessions stored without outcome provenance, post-hoc regex redaction presented as privacy safety, and optimization on the same sessions used for promotion.",
"type": "weighted_checklist",
"checklist": [
{
"name": "0-4 maturity scores",
"description": "Report assigns numeric maturity scores using the 0-4 scale for at least three distinct domains (not just a single overall rating)",
"max_score": 12
},
{
"name": "No collapsed average",
"description": "Report does NOT present a single overall maturity score or average — scores are kept per-domain to show the shape of risk",
"max_score": 8
},
{
"name": "Required report headings",
"description": "Report includes at least eight of these section headings (exact wording or close equivalent): Executive Summary, Evidence Reviewed, Architecture and Learning Loop, Maturity Scorecard, Critical Findings, High Findings, Medium Findings, Low Findings, Domain-by-Domain Assessment, Privacy Security and Provenance Notes, Evaluation and Counterfactual Coverage, Observability and Cost Latency Notes, Failure Mode Review, Prioritized Roadmap, Open Questions",
"max_score": 12
},
{
"name": "LLM-as-judge calibration flag",
"description": "Report explicitly states that LLM-as-judge scores (judge.py) cannot be treated as ground truth without calibration against human labels or held-out references",
"max_score": 12
},
{
"name": "Post-hoc redaction guardrail",
"description": "Report does NOT claim privacy safety based solely on the regex redaction in judge.py — notes that post-hoc redaction alone is insufficient (raw transcripts still stored in session_log/ per README)",
"max_score": 10
},
{
"name": "Transcript evidence gap",
"description": "Report notes that the session transcripts in session_log/ lack outcome labels, version tags, or provenance in findings, evidence tables, or open questions — and does NOT treat them as clean learning evidence",
"max_score": 10
},
{
"name": "Validation split and optimize-on-gate finding",
"description": "Report identifies that the prompt optimization loop scores candidates on the same sessions used for baseline or promotion scoring, with no held-out validation split or separate promotion gate",
"max_score": 18
},
{
"name": "Roadmap with buckets",
"description": "Report includes a roadmap section that organizes recommendations into Stabilize, Standardize, and Scale groupings (or equivalent sequenced tiers)",
"max_score": 10
},
{
"name": "No safe-learning claim",
"description": "Report does NOT declare the current optimization process safe or production-ready without noting the absence of human review, eval gates, or rollback",
"max_score": 8
}
]
}