Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.
100
100%
Does it follow best practices?
Impact
100%
1.28xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent audits generated executable skills as a safety-critical learning artifact. The system creates skills from successful sessions, allows broad tools, executes generated scripts during validation, publishes to a shared registry without human review, and has weak provenance/eval gates.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Lifecycle and maturity map",
"description": "Report maps the generated skill lifecycle from source sessions through packaging, validation, registry publish, install/runtime use, or explicitly marks missing lifecycle evidence as a gap, with per-domain 0-4 maturity scores",
"max_score": 12
},
{
"name": "Executable sandbox risk",
"description": "Report flags generated scripts executing without a clear sandbox or with broad filesystem, network, package install, shell, or tool access; quantifying the host risk surface is not required",
"max_score": 15
},
{
"name": "Provenance metadata",
"description": "Report identifies weak provenance metadata such as source trace lineage, artifact hashes/signatures, model or prompt versioning, generated script provenance, or supply-chain tamper risk; dependency pinning is optional unless package evidence is present",
"max_score": 12
},
{
"name": "Registry is not verification",
"description": "Report states clearly that registry presence or manifest metadata is not proof the generated skill is verified or safe",
"max_score": 10
},
{
"name": "Eval gate coverage",
"description": "Report requires at least two concrete pre-promotion eval gates such as held-out tasks, replay tests, paired with/without evals, negative-trigger tests, syntax or static checks, CI integration, or separate promotion gates",
"max_score": 13
},
{
"name": "Human review and rollback",
"description": "Report flags missing reviewer queue, approval matrix, staged rollout, canary, deprecation state, and rollback path",
"max_score": 12
},
{
"name": "Trigger and activation safety",
"description": "Report assesses activation safety such as missing trigger definitions or quality, positive/negative trigger tests, conflict handling, over-activation, activation policy, or deprecation metadata; quantitative trigger metrics are not required",
"max_score": 8
},
{
"name": "Deployment controls",
"description": "Report covers execution/deployment boundaries or controls such as local validator or agent-host risk, tool-permission hardening, latest-install risk, version pinning, CI gates, canary, staged rollout, observability, or rollback; the same control may be covered in another finding",
"max_score": 8
},
{
"name": "Severity-ranked findings",
"description": "Findings include severity, evidence, impact, affected surfaces, recommended fix, owner/function, and sequencing dependency",
"max_score": 6
},
{
"name": "Stabilize before scale roadmap",
"description": "Roadmap prioritizes sandboxing, provenance, review, eval gates, and rollback before registry expansion or optimization",
"max_score": 4
}
]
}