CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/llm-learning-system-auditor

Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.

100

1.28x
Quality

100%

Does it follow best practices?

Impact

100%

1.28x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-1/

{
  "context": "Tests whether the agent audits a session-history learning loop instead of treating stored memories and rules as automatically safe. The system stores raw traces, promotes semantic memories and global rules with weak provenance, lacks counterexamples and rollback, and has no human review or counterfactual eval gate.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Learning-loop map",
      "description": "Report maps the flow from session traces through memory extraction, rule induction, promotion, and runtime injection",
      "max_score": 10
    },
    {
      "name": "0-4 maturity scorecard",
      "description": "Report assigns integer 0-4 maturity scores for at least four relevant domains and does not collapse them into a total or average",
      "max_score": 12
    },
    {
      "name": "Trace provenance and clean-evidence gap",
      "description": "Report states raw sessions are not clean learning evidence because outcome labels, version tags, or provenance metadata are missing or incomplete",
      "max_score": 12
    },
    {
      "name": "Memory governance finding",
      "description": "Report flags memory schema or write-gate gaps such as missing scope, confidence, sensitivity, lifecycle/staleness, deletion, source provenance, or user tenancy boundaries",
      "max_score": 12
    },
    {
      "name": "Rule induction guardrails",
      "description": "Report identifies weak global-rule guardrails such as missing counterexamples, conflict handling, source clusters, negative triggers, single-session promotion controls, or rollback before runtime use",
      "max_score": 12
    },
    {
      "name": "Review, promotion, and rollback",
      "description": "Report flags missing human review, staged rollout, canary, protected promotion state, and rollback path before runtime injection",
      "max_score": 12
    },
    {
      "name": "Privacy and retention risk",
      "description": "Report identifies raw trace retention or masking/access-control gaps without claiming privacy safety from masking alone; encryption only needs mention if evidence supports it",
      "max_score": 10
    },
    {
      "name": "Counterfactual eval coverage",
      "description": "Report asks for counterfactual coverage such as with/without memory or rule comparisons, replay/fork support, ablations, negative controls, or regression analysis; exact terminology is not required",
      "max_score": 8
    },
    {
      "name": "Severity-ranked findings",
      "description": "Findings include severity, evidence, impact, affected surfaces, recommended fix, owner/function, and sequencing dependency",
      "max_score": 8
    },
    {
      "name": "Sequenced roadmap",
      "description": "Report organizes recommendations into Stabilize, Standardize, and Scale or equivalent risk-ordered phases",
      "max_score": 4
    }
  ]
}

evals

scenario-1

criteria.json

task.md

README.md

tile.json