CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/llm-learning-system-auditor

Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.

100

1.28x
Quality

100%

Does it follow best practices?

Impact

100%

1.28x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "Tests whether the agent audits generated executable skills as a safety-critical learning artifact. The system creates skills from successful sessions, allows broad tools, executes generated scripts during validation, publishes to a shared registry without human review, and has weak provenance/eval gates.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Lifecycle and maturity map",
      "description": "Report maps the generated skill lifecycle from source sessions through packaging, validation, registry publish, install/runtime use, or explicitly marks missing lifecycle evidence as a gap, with per-domain 0-4 maturity scores",
      "max_score": 12
    },
    {
      "name": "Executable sandbox risk",
      "description": "Report flags generated scripts executing without a clear sandbox or with broad filesystem, network, package install, shell, or tool access; quantifying the host risk surface is not required",
      "max_score": 15
    },
    {
      "name": "Provenance metadata",
      "description": "Report identifies weak provenance metadata such as source trace lineage, artifact hashes/signatures, model or prompt versioning, generated script provenance, or supply-chain tamper risk; dependency pinning is optional unless package evidence is present",
      "max_score": 12
    },
    {
      "name": "Registry is not verification",
      "description": "Report states clearly that registry presence or manifest metadata is not proof the generated skill is verified or safe",
      "max_score": 10
    },
    {
      "name": "Eval gate coverage",
      "description": "Report requires at least two concrete pre-promotion eval gates such as held-out tasks, replay tests, paired with/without evals, negative-trigger tests, syntax or static checks, CI integration, or separate promotion gates",
      "max_score": 13
    },
    {
      "name": "Human review and rollback",
      "description": "Report flags missing reviewer queue, approval matrix, staged rollout, canary, deprecation state, and rollback path",
      "max_score": 12
    },
    {
      "name": "Trigger and activation safety",
      "description": "Report assesses activation safety such as missing trigger definitions or quality, positive/negative trigger tests, conflict handling, over-activation, activation policy, or deprecation metadata; quantitative trigger metrics are not required",
      "max_score": 8
    },
    {
      "name": "Deployment controls",
      "description": "Report covers execution/deployment boundaries or controls such as local validator or agent-host risk, tool-permission hardening, latest-install risk, version pinning, CI gates, canary, staged rollout, observability, or rollback; the same control may be covered in another finding",
      "max_score": 8
    },
    {
      "name": "Severity-ranked findings",
      "description": "Findings include severity, evidence, impact, affected surfaces, recommended fix, owner/function, and sequencing dependency",
      "max_score": 6
    },
    {
      "name": "Stabilize before scale roadmap",
      "description": "Roadmap prioritizes sandboxing, provenance, review, eval gates, and rollback before registry expansion or optimization",
      "max_score": 4
    }
  ]
}

evals

README.md

tile.json