CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/llm-learning-system-auditor

Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.

100

1.28x
Quality

100%

Does it follow best practices?

Impact

100%

1.28x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

SKILL.mdskills/llm-learning-system-auditor/

name:
llm-learning-system-auditor
description:
Use when the user wants to review, audit, or check safety for an AI memory system, agent learning pipeline, prompt-tuning workflow, skill builder, trace-mining tool, or eval/feedback loop. Produces an evidence-led audit report with learning-loop map, evidence inventory, maturity scorecard, severity-ranked findings, privacy/provenance gaps, counterfactual/eval coverage, and Stabilize/Standardize/Scale roadmap.
metadata:
{"version":"0.2.22","source_domain":"llm-learning-systems","source_sub_domains":"session-tool-trace-capture-architecture, trace-parsing-normalization-event-taxonomies, outcome-signals-rewards-evals-feedback-labels, memory-extraction-semantic-episodic-procedural, rule-instruction-induction, skill-discovery-packaging-registries, programmatic-skill-induction-verification, prompt-few-shot-policy-optimization, long-horizon-context-state-blocks, observability-cost-latency-trace-analytics, counterfactual-auditing-skill-impact, human-review-curation-promotion-rollback, privacy-retention-provenance-security, benchmarks-metrics-learning-agents, deployment-architectures-local-cloud-mcp-git, failure-modes-stale-rules-overfitting","research_date":"2026-05-17"}

LLM Learning System Auditor

Purpose

Audit LLM learning loops from raw traces to promoted artifacts: privacy, provenance, eval gates, rollback, stale behavior, overfitting, context poisoning, and regression risk. Load only the needed reference file.

ReferenceContains
evidence-inventory.mdEvidence table and status labels
audit-domains.mdDomain checks
generated-skill-checks.mdGenerated executable skill checks
findings-and-roadmap.mdSeverity and roadmap rules
report-template-and-guardrails.mdReport skeleton and guardrails

First Actions

Start from evidence, not architecture claims.

rg --files | rg '(memory|rule|skill|prompt|eval|trace|span|session|feedback|artifact|registry|policy|retention|provenance|redact|pii|secret)'
rg -n "memory|rule|skill|prompt|eval|trace|span|session|feedback|artifact|registry|rollback|canary|retention|redact|PII|secret|provenance|counterfactual" .
rg --files | rg '(^|/)(README|docs|design|architecture|evals|traces|skills|prompts|policies|.github/workflows)'

Before findings, produce a factual brief: runtime/approval boundary; evidence and outcome signals; promoted artifacts; promotion/canary/deprecation/rollback; storage, privacy, tool-permission, and tenancy risks.

If artifacts are unavailable, label the gap. Do not imply traces, dashboards, stores, datasets, policies, or code were reviewed when they were not provided.

Workflow

PhaseActionDetail
1Scope the learning loopBuild the brief from direct evidence
2Inventory evidenceUse evidence-inventory.md
3Score maturityScore each relevant area 0-4
4Audit domainsApply audit-domains.md
5Prioritize findingsUse findings-and-roadmap.md
6Produce reportUse report-template-and-guardrails.md

Maturity Scores

Scale: 0 absent; 1 ad hoc/local; 2 incomplete; 3 owned, versioned, gated; 4 measured, privacy-aware, regression-tested, reviewable. Never average scores.

Critical Audit Anchors

  • Evidence: missing outcome labels, version tags, or provenance metadata => not clean learning evidence.
  • Scoring: per-domain 0-4 integers only; no overall, total, average, or aggregate maturity.
  • Optimization: held-out validation plus separate promotion gate; same-session baseline/candidate/promotion scoring is optimize-on-gate risk.
  • LLM judges: not ground truth without human labels, references, or repeated held-out calibration.
  • Redaction: post-hoc masking is not privacy safety if raw logs are retained.
  • Rule induction: require counterexamples, negative triggers, conflicts, independent source clusters, and rollback before runtime promotion.
  • Generated executable skills: load generated-skill-checks.md. Missing sandbox, provenance, eval gates, trigger policy, local/cloud/MCP boundary, review, or rollback is a finding.

Report Minimums

Use these headings exactly when producing the final audit:

## Executive Summary
## Evidence Reviewed
## Architecture and Learning Loop
## Maturity Scorecard
## Critical Findings
## High Findings
## Medium Findings
## Low Findings
## Prioritized Roadmap
## Open Questions

Scorecard row format: | Domain | Score (0-4) | Evidence | Rationale |

Finding Contract

Lead with findings ordered by severity. Every finding must include this block:

- Severity:
- Evidence checked: include `path:line` for local file evidence when available
- Impact:
- Affected learning artifacts or runtime surfaces:
- Recommended fix:
- Owner/function:
- Sequencing dependency:

Use findings-and-roadmap.md for severity classification and roadmap sequencing.

README.md

tile.json