Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.
100
100%
Does it follow best practices?
Impact
100%
1.28xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent audits a session-history learning loop instead of treating stored memories and rules as automatically safe. The system stores raw traces, promotes semantic memories and global rules with weak provenance, lacks counterexamples and rollback, and has no human review or counterfactual eval gate.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Learning-loop map",
"description": "Report maps the flow from session traces through memory extraction, rule induction, promotion, and runtime injection",
"max_score": 10
},
{
"name": "0-4 maturity scorecard",
"description": "Report assigns integer 0-4 maturity scores for at least four relevant domains and does not collapse them into a total or average",
"max_score": 12
},
{
"name": "Trace provenance and clean-evidence gap",
"description": "Report states raw sessions are not clean learning evidence because outcome labels, version tags, or provenance metadata are missing or incomplete",
"max_score": 12
},
{
"name": "Memory governance finding",
"description": "Report flags memory schema or write-gate gaps such as missing scope, confidence, sensitivity, lifecycle/staleness, deletion, source provenance, or user tenancy boundaries",
"max_score": 12
},
{
"name": "Rule induction guardrails",
"description": "Report identifies weak global-rule guardrails such as missing counterexamples, conflict handling, source clusters, negative triggers, single-session promotion controls, or rollback before runtime use",
"max_score": 12
},
{
"name": "Review, promotion, and rollback",
"description": "Report flags missing human review, staged rollout, canary, protected promotion state, and rollback path before runtime injection",
"max_score": 12
},
{
"name": "Privacy and retention risk",
"description": "Report identifies raw trace retention or masking/access-control gaps without claiming privacy safety from masking alone; encryption only needs mention if evidence supports it",
"max_score": 10
},
{
"name": "Counterfactual eval coverage",
"description": "Report asks for counterfactual coverage such as with/without memory or rule comparisons, replay/fork support, ablations, negative controls, or regression analysis; exact terminology is not required",
"max_score": 8
},
{
"name": "Severity-ranked findings",
"description": "Findings include severity, evidence, impact, affected surfaces, recommended fix, owner/function, and sequencing dependency",
"max_score": 8
},
{
"name": "Sequenced roadmap",
"description": "Report organizes recommendations into Stabilize, Standardize, and Scale or equivalent risk-ordered phases",
"max_score": 4
}
]
}