Name: sharaf/llm-learning-system-auditor
Rating: 100 (1 reviews)
Author: sharaf

sharaf/llm-learning-system-auditor

Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.

100

1.28x

Quality

100%

Does it follow best practices?

Impact

100%

1.28x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

{
  "context": "Tests whether the agent applies domain-specific maturity scoring (0-4 scale without averaging), uses the required report section headings, and applies key guardrails — particularly around LLM-as-judge calibration, transcript evidence requirements, and post-hoc redaction claims. The PromptTuner system has several issues: uncalibrated LLM-as-judge, sessions stored without outcome provenance, post-hoc regex redaction presented as privacy safety, and optimization on the same sessions used for promotion.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "0-4 maturity scores",
      "description": "Report assigns numeric maturity scores using the 0-4 scale for at least three distinct domains (not just a single overall rating)",
      "max_score": 12
    },
    {
      "name": "No collapsed average",
      "description": "Report does NOT present a single overall maturity score or average — scores are kept per-domain to show the shape of risk",
      "max_score": 8
    },
    {
      "name": "Required report headings",
      "description": "Report includes at least eight of these section headings (exact wording or close equivalent): Executive Summary, Evidence Reviewed, Architecture and Learning Loop, Maturity Scorecard, Critical Findings, High Findings, Medium Findings, Low Findings, Domain-by-Domain Assessment, Privacy Security and Provenance Notes, Evaluation and Counterfactual Coverage, Observability and Cost Latency Notes, Failure Mode Review, Prioritized Roadmap, Open Questions",
      "max_score": 12
    },
    {
      "name": "LLM-as-judge calibration flag",
      "description": "Report explicitly states that LLM-as-judge scores (judge.py) cannot be treated as ground truth without calibration against human labels or held-out references",
      "max_score": 12
    },
    {
      "name": "Post-hoc redaction guardrail",
      "description": "Report does NOT claim privacy safety based solely on the regex redaction in judge.py — notes that post-hoc redaction alone is insufficient (raw transcripts still stored in session_log/ per README)",
      "max_score": 10
    },
    {
      "name": "Transcript evidence gap",
      "description": "Report notes that the session transcripts in session_log/ lack outcome labels, version tags, or provenance in findings, evidence tables, or open questions — and does NOT treat them as clean learning evidence",
      "max_score": 10
    },
    {
      "name": "Validation split and optimize-on-gate finding",
      "description": "Report identifies that the prompt optimization loop scores candidates on the same sessions used for baseline or promotion scoring, with no held-out validation split or separate promotion gate",
      "max_score": 18
    },
    {
      "name": "Roadmap with buckets",
      "description": "Report includes a roadmap section that organizes recommendations into Stabilize, Standardize, and Scale groupings (or equivalent sequenced tiers)",
      "max_score": 10
    },
    {
      "name": "No safe-learning claim",
      "description": "Report does NOT declare the current optimization process safe or production-ready without noting the absence of human review, eval gates, or rollback",
      "max_score": 8
    }
  ]
}

evals

scenario-1

scenario-2

scenario-3

skills

sharaf/llm-learning-system-auditor

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-3/

criteria.jsonevals/scenario-3/