CtrlK
BlogDocsLog inGet started
Tessl Logo

pantheon-ai/skill-quality-auditor

Audit and improve skill collections with a 9-dimension scoring framework (Knowledge Delta, Mindset, Anti-Patterns, Specification Compliance, Progressive Disclosure, Freedom Calibration, Pattern Recognition, Practical Usability, Eval Validation), duplication detection, remediation planning, baseline comparison, and CI quality gates; use when evaluating skill quality, generating remediation plans, detecting duplicates, validating artifact conventions, or enforcing publication thresholds.

93

1.26x
Quality

89%

Does it follow best practices?

Impact

99%

1.26x

Average score across 5 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

framework-dimensions.mdreferences/

category:
framework
priority:
CRITICAL
source:
quality framework + session experience

Skill-Judge Evaluation Framework: 9 Dimensions

Complete evaluation methodology for assessing skill quality using the 9-dimension quality framework. This is the foundation for all quality auditing.

Canonical source reference: framework-dimensions.md

Overview

The 9-dimension quality framework evaluates skills across 9 dimensions totaling 140 points. Dimension 1 (Knowledge Delta) and Dimension 9 (Eval Validation) carry the highest weight at 20 points each - skills must contain expert-only knowledge AND be validated at runtime via tessl eval scenarios.

Target Score: ≥126 points (90%) = A-grade

Dimension 1: Knowledge Delta (20 points) ⭐ MOST IMPORTANT

Purpose: Ensure skill contains expert-only knowledge, not redundant information.

Scoring:

  • 18-20 points: Pure expert knowledge, <5% redundancy
  • 15-17 points: Mostly expert, 5-15% redundancy
  • 12-14 points: 15-30% redundancy (acceptable)
  • 9-11 points: 30-50% redundancy (needs improvement)
  • 0-8 points: >50% redundancy (failing)

Core Principle: Skill = Expert Knowledge - What AI Assistants Already Know

Three Knowledge Types

  1. Expert (KEEP):

    • Domain-specific patterns AI assistants don't know
    • Project-specific conventions
    • Lessons from production experience
    • Tool gotchas and non-obvious behavior
    • Decision frameworks (when to use X vs Y)
    • Anti-patterns with WHY they fail
  2. Activation (BRIEF REMINDERS OK):

    • When to use this skill
    • Trigger keywords for pattern matching
    • Brief context setting (2-3 sentences)
  3. Redundant (DELETE):

    • Basic syntax AI assistants know
    • Installation instructions from official docs
    • API documentation copied verbatim
    • Generic best practices
    • Obvious examples

Red Flags for Low Knowledge Delta

❌ Teaching basic syntax (AI assistants know if/else, function, class)
❌ Copying official documentation (schema definitions, rule lists)
❌ Explaining fundamentals (what is REST, what is a database)
❌ Generic advice (write tests, use version control)
❌ Installation tutorials (npm install, pip install)

Examples

❌ Low Knowledge Delta (12/20):

# TypeScript Basics

## Variables
Use `let` for mutable, `const` for immutable:
let count = 0
const name = "Alice"

## Functions
Functions can be declared or arrow:
function add(a: number, b: number) { return a + b }
const add = (a: number, b: number) => a + b

Problem: AI assistants already know basic TypeScript syntax.

✅ High Knowledge Delta (19/20):

# TypeScript: Making Illegal States Unrepresentable

## The Pattern
Use discriminated unions to eliminate impossible states:

❌ BAD: Multiple optional fields create 16 possible states
type Request = {
  loading?: boolean
  error?: string
  data?: User
}

✅ GOOD: Tagged union with 3 valid states only
type Request = 
  | { status: 'loading' }
  | { status: 'error', error: string }
  | { status: 'success', data: User }

## Why This Matters
Bad design allows bugs: `{ loading: true, data: user }` is impossible but TypeScript allows it.
Good design: TypeScript prevents impossible states at compile time.

Expert pattern AI assistants don't know by default.

Dimension 2: Mindset + Procedures (15 points)

Purpose: Provide philosophical framing and step-by-step workflows.

Scoring:

  • 13-15 points: Clear mindset + detailed procedures + when/when-not
  • 10-12 points: Has most elements, minor gaps
  • 7-9 points: Missing key element
  • 0-6 points: Generic or absent

Components

  1. Clear Mindset/Philosophy (5 points)

    • Core principle or philosophy
    • Why this approach over alternatives
    • Example: "Trust but verify" (proof-of-work), "Composition over inheritance" (structural-design)
  2. Step-by-Step Procedures (5 points)

    • Numbered workflow
    • Clear entry/exit points
    • Validation steps
    • Example: TDD cycle (Red → Green → Refactor)
  3. When/When-Not Guidance (5 points)

    • Clear activation criteria
    • Explicit non-applicable scenarios
    • Example: "Use for backend APIs, NOT for UI styling"

Example

✅ Strong Mindset + Procedures (15/15):

# Test-Driven Development

## Mindset
Write tests BEFORE implementation. The test defines the contract; implementation fulfills it.

## Workflow
1. Red: Write failing test (verify it fails)
2. Green: Minimum code to pass
3. Refactor: Improve without breaking tests

## When to Apply
✅ New functions, features, bug fixes (reproduce first)
❌ UI styling, configuration, documentation

## When NOT to Apply
- Throwaway prototypes
- Generated code
- Trivial getters/setters

Dimension 3: Anti-Pattern Quality (15 points)

Purpose: Teach what NOT to do with clear explanations of WHY.

Scoring:

  • 13-15 points: NEVER lists + concrete examples + consequences
  • 10-12 points: Has most elements
  • 7-9 points: Generic warnings
  • 0-6 points: Missing or weak

Components

  1. NEVER Lists with WHY (5 points)

    • Explicit "NEVER do X because Y" statements
    • Not just "avoid" - use strong language
    • Example: "NEVER trust agent completion reports without verification"
  2. Concrete Examples (5 points)

    • Show bad code, not just descriptions
    • Side-by-side ❌ BAD / ✅ GOOD comparisons
    • Real-world scenarios
  3. Consequences Explained (5 points)

    • What breaks when anti-pattern used
    • Impact: security, performance, maintainability
    • Example: "Leads to SQL injection attacks"

Example

✅ Strong Anti-Patterns (14/15):

## Anti-Patterns

❌ **NEVER use string interpolation for SQL**
WHY: Opens SQL injection vulnerabilities

// BAD - Vulnerable to injection
db.query(`SELECT * FROM users WHERE id = ${userId}`)

// GOOD - Safe with prepared statements
db.query('SELECT * FROM users WHERE id = ?', [userId])

**Consequence:** Attacker can inject `1 OR 1=1` to dump entire table.

❌ **NEVER skip test failure verification**
WHY: False positives waste hours debugging phantom issues

**Consequence:** Test passes even with bugs, leading to production failures.

Dimension 4: Specification Compliance (15 points)

Purpose: Ensure proper frontmatter, single-task focus, activation keywords, and cross-harness portability.

Scoring:

  • 13-15 points: Perfect spec compliance
  • 10-12 points: Minor issues
  • 7-9 points: Missing key elements
  • 0-6 points: Non-compliant

Components

  1. Task Focus Declaration (4 points) ⭐ CRITICAL

    • Skill indicates ONE type of task it helps complete
    • Description clearly scopes to single purpose
    • No ambiguity about what the skill does
    • Example: "Write BDD tests" (good) vs "Testing and development" (bad - two tasks)
  2. Description Field Quality (6 points)

    • Primary agents: Exactly 3 words
    • Other agents: Comprehensive with trigger examples
    • Must include activation keywords
    • Determines if skill activates
  3. Cross-Harness Portability (3 points) ⭐ CRITICAL

    • No harness-specific paths (1 point): Avoid .opencode/, .claude/, .cursor/, .aider/, .continue/
    • No agent-specific references (1 point): Don't mention "Claude Code", "Cursor Agent", "GitHub Copilot", etc. in instructions
    • Relative path usage (1 point): Reference files relative to skill directory (scripts/, references/, templates/)
    • WHY: Skills must work across 40+ agentic harnesses without modification
    • IMPACT: Harness-specific paths break skill discovery when synced to other agents
  4. Self-Containment (penalties: up to -12 points) ⭐ CRITICAL

    SKILL.md penalties (checked outside fenced code blocks):

    • No parent-escaping paths (-2 points): SKILL.md must not use ../ references outside fenced code blocks. Skills are installed to arbitrary locations; parent paths break when the skill is not in its original repo.
    • No absolute repo paths (-1 point): SKILL.md must not reference skills/X/Y/Z or other hardcoded repository paths outside fenced code blocks. Cross-skill dependencies should use skill names, not file paths.
    • No repo-root directory references (-1 point): SKILL.md must not reference .context/, .agents/, or other repo-root directories outside fenced code blocks.

    scripts/ penalties (-1 per file with violation, cap -2 per category):

    • No absolute repo paths in scripts (-2 max): Each script file referencing skills/X/Y/Z paths loses 1 point (capped at -2 total).
    • No repo-root directory references in scripts (-2 max): Each script file referencing .context/ or .agents/ paths loses 1 point (capped at -2 total).

    references/ penalties (-1 per file with violation, cap -2 per category):

    • No absolute repo paths in references (-2 max): Each reference file referencing skills/X/Y/Z paths loses 1 point (capped at -2 total).

    • No repo-root directory references in references (-2 max): Each reference file referencing .context/ or .agents/ paths loses 1 point (capped at -2 total).

    • WHY: Skills must be fully self-contained. When installed via tessl install or npx skills add, they land in arbitrary directories. Any reference to files outside the skill's own directory tree will break — whether in SKILL.md, scripts, or reference files.

    • IMPACT: Non-self-contained skills fail silently when installed outside their authoring repo.

  5. Script Language Portability (bonus: +1 point)

    • Skills with scripts/ containing Python (.py), TypeScript (.ts), or JavaScript (.js) files earn a portability bonus.
    • These languages provide better cross-platform string manipulation, JSON handling, and error handling compared to shell for complex logic.
    • Shell scripts (.sh) remain the accepted default and receive no penalty.
    • Accepted shebangs: #!/usr/bin/env python3 (Python), #!/usr/bin/env bun (TypeScript), #!/usr/bin/env node (JavaScript)
    • WHY: Complex scripts that parse JSON, manipulate strings, or make HTTP calls are more portable and robust in Python/TS/JS than in POSIX shell (which depends on external tools like jq, and has GNU-vs-BSD divergence for grep/sed/awk).
  6. Proper Frontmatter (1 point)

    • name, description present
    • Consolidation notes if applicable
    • Correct YAML syntax
  7. Activation Keywords (1 point)

    • Domain terms that trigger skill
    • Example: "BDD, Gherkin, Given-When-Then, Cucumber"
  8. References Section Format (bonus: +1 point)

    • See the References Section Standard below.
    • +1 point: heading is exactly ## References, it is the last H2 in SKILL.md, content is a Markdown table with Topic | Reference | When to Use columns, every Reference cell is a markdown link
    • 0 points: section missing when references exist, wrong heading name (e.g. ## Resources, ## Quick Reference), bullet list instead of table, bare URLs, plain-text paths, or missing required columns
    • Omission without penalty: skills with no references/ directory and no external resources may omit the section entirely
    • WHY: A 3-column table forces authors to articulate what a reference covers and when an agent should load it — making references actionable rather than decorative

Examples

✅ Excellent Specification Compliance (15/15):

---
name: bdd-testing
description: Behavior-Driven Development with Given-When-Then scenarios, Cucumber.js, Three Amigos collaboration, Example Mapping, living documentation, and acceptance criteria. Use when writing BDD tests, feature files, or planning discovery workshops.
---

# BDD Testing

Execute test runner with portable path:
```bash
bun run scripts/run-tests.sh

Reference files use relative paths: references/file.md

*Perfect: comprehensive description, portable paths (scripts/, references/), no agent mentions*

**❌ Poor Specification Compliance (7/15):**

```yaml
---
name: bdd-testing  
description: BDD testing patterns
---

# BDD Testing

For Claude Code users, run:
```bash
.opencode/scripts/run-tests.sh

For Cursor users, see .claude/docs/file.md

*Problems: weak description, harness-specific paths (.opencode/, .claude/), agent-specific references*

### References Section Standard

Every SKILL.md that has references or external resources MUST end with a `## References` section using a **3-column Markdown table** with columns `Topic`, `Reference`, and `When to Use`:

```markdown
## References

| Topic | Reference | When to Use |
| --- | --- | --- |
| Security patterns, caching, and trigger configuration | [Best Practices](references/best-practices.md) | Every time you generate a workflow |
| Pinned action versions and input/output specs | [Common Actions](references/common-actions.md) | When using any public action |
| Official workflow syntax and expression reference | [GitHub Actions Docs](https://docs.github.com/en/actions) | For syntax lookup |

Sub-sections (H3 headings) are allowed to group rows by theme when a skill has many references:

## References

### Generators

| Topic | Reference | When to Use |
| --- | --- | --- |
| Tree API patterns for file operations | [Tree API Reference](references/tree-api-reference.md) | Any generator that reads or writes files |

### Executors

| Topic | Reference | When to Use |
| --- | --- | --- |
| ExecutorContext fields and lifecycle | [Executor Context API](references/executor-context-api.md) | Building a custom executor |

Rules:

RuleRequirement
HeadingExactly ## References — no variants (## Resources, ## See Also, etc.)
PositionLast H2 section in the file
FormatMarkdown table with Topic | Reference | When to Use columns — no bullet lists, no bare URLs
Reference columnEvery cell in the Reference column MUST be a markdown link [text](url)
Topic columnOne-line description of what the referenced file or resource covers
When to Use columnConcrete scenario that tells the agent when to load or consult the reference
Sub-sectionsOptional H3 headings are allowed to group rows by theme
OmissionAllowed only when the skill has nothing to reference (no penalty)

❌ Non-compliant (0 bonus points):

## Resources

- references/file.md
- https://example.com/docs

Problems: wrong heading (## Resources), bullet list instead of table, bare path and bare URL.

## References

- [Error Patterns](references/error-patterns.md) — common failure modes

Problems: bullet list format — table with Topic/Reference/When to Use is required.

✅ Compliant (+1 bonus point):

## References

| Topic | Reference | When to Use |
| --- | --- | --- |
| Common failure modes and remediation steps | [Error Patterns](references/error-patterns.md) | When diagnosing unexpected output or synth failures |
| Authoritative command and flag reference | [Official CLI Docs](https://example.com/cli) | For exact flag syntax lookup |

Dimension 5: Progressive Disclosure (15 points)

Purpose: Structure content for on-demand loading, not frontloading everything.

Scoring:

  • 13-15 points: Navigation hub + references/ + categories + lazy-load guidance
  • 10-12 points: Some organization, could improve
  • 7-9 points: Everything frontloaded, >300 lines
  • 0-6 points: No structure, >500 lines

Components

  1. Navigation Hub Approach (5 points)

    • SKILL.md is <100 lines
    • Overview + when-to-use + reference guide
    • NOT full content
    • Example: supabase-postgres-best-practices (65 lines)
  2. References Directory (4 points)

    • Detailed content in references/*.md
    • Each reference 100-500 lines
    • Focused on ONE topic
  3. Category Organization (3 points)

    • Files organized by prefix (principles-, patterns-, etc.)
    • Priority labels (CRITICAL, HIGH, MEDIUM, LOW)
  4. Lazy Loading Guidance (3 points) ⭐ REQUIRED

    • References table includes a concrete "When to Use" column that tells agents exactly which task triggers loading each reference
    • AGENTS.md (or equivalent navigation file) explicitly instructs agents to load only the minimum references needed for the current task — not all references upfront
    • Each reference entry states a specific, actionable condition (e.g. "When diagnosing a D3 failure", "Only when preparing a CI gate") rather than a generic description
    • WHY: Without explicit lazy-load guidance, agents default to loading all references eagerly, wasting context on irrelevant content and degrading performance
    • IMPACT: Skills without lazy-load guidance consume 3-10x more context than necessary, reducing the agent's effective working memory

Lazy Loading Anti-Patterns

NEVER list references without "When to Use" conditions

## References
- references/dimensions.md
- references/scoring.md
- references/anti-patterns.md

This forces agents to load everything or guess what to load.

NEVER use vague "When to Use" entries

| Topic | Reference | When to Use |
| --- | --- | --- |
| Scoring | [Scoring Rubric](references/scoring.md) | For scoring |
| Anti-patterns | [Anti-Patterns](references/anti-patterns.md) | For anti-patterns |

"For scoring" is not actionable — it does not tell the agent when NOT to load the file.

Explicit lazy-load conditions:

| Topic | Reference | When to Use |
| --- | --- | --- |
| Per-dimension criteria and bonus rules | [Dimensions](references/dimensions.md) | Evaluating any individual dimension or understanding the rubric |
| Score thresholds and grade bands | [Scoring Rubric](references/scoring.md) | Calculating a total score or assigning a grade — skip if only auditing structure |
| NEVER/WHY/BAD/GOOD failure modes | [Anti-Patterns](references/anti-patterns.md) | Explaining why a dimension scored low or writing remediation guidance |

AGENTS.md with explicit lazy-load instruction:

## Usage Instructions

1. Read SKILL.md — navigation hub only
2. Identify your task from the task categories below
3. Load ONLY the references listed for that task
4. Do NOT pre-load all references

Example

✅ Excellent Progressive Disclosure (15/15):

bdd-testing/
├── SKILL.md (64 lines - navigation hub with actionable "When to Use" per reference)
├── AGENTS.md (explicit: "load only references needed for current task")
└── references/
    ├── principles-three-amigos.md (CRITICAL, 250 lines)
    ├── gherkin-syntax.md (HIGH, 180 lines)
    └── practices-tags.md (MEDIUM, 120 lines)

❌ Poor Progressive Disclosure (6/15):

bdd-testing/
└── SKILL.md (1,800 lines - everything frontloaded)

❌ Missing Lazy Loading (10/15 — loses 3 points):

bdd-testing/
├── SKILL.md (80 lines - good hub, but references table has no "When to Use" column)
├── AGENTS.md (says "load all references before starting")
└── references/
    ├── principles-three-amigos.md
    ├── gherkin-syntax.md
    └── practices-tags.md

Dimension 6: Freedom Calibration (15 points)

Purpose: Balance prescription (rigid rules) vs flexibility (guidelines).

Scoring:

  • 13-15 points: Appropriate for skill type
  • 10-12 points: Slightly too rigid or loose
  • 7-9 points: Mismatched calibration
  • 0-6 points: Completely wrong

Calibration Levels

  1. Rigid (Mindset skills): Strong rules, must follow

    • Example: proof-of-work "NEVER trust agent reports without verification"
    • Use: Critical foundations, security, correctness
  2. Balanced (Process skills): Clear steps with flexibility

    • Example: TDD "Red → Green → Refactor (adapt to context)"
    • Use: Workflows, methodologies
  3. Flexible (Tool skills): Options and trade-offs

    • Example: typescript-type-system "Choose based on use case"
    • Use: Technical tools, patterns

Example

✅ Well-Calibrated (14/15):

# Proof of Work (Mindset skill)

## Zero-Tolerance Rules
NEVER trust agent completion reports without verification.
ALWAYS show command output as proof.
ZERO exceptions to verification protocol.

Appropriately rigid for critical verification.

❌ Miscalibrated (7/15):

# TypeScript Basics (Tool skill)

## Rules
ALWAYS use const for all variables.
NEVER use let or var under any circumstances.

Too rigid - let has valid use cases.

Dimension 7: Pattern Recognition (10 points)

Purpose: Ensure skill activates when needed via description keywords.

Scoring:

  • 9-10 points: Rich keywords, comprehensive triggers
  • 7-8 points: Good keywords, could expand
  • 5-6 points: Basic keywords
  • 0-4 points: Missing or poor

Requirements

  • Description must include domain keywords
  • Trigger scenarios in description or "When to Apply"
  • Example: "Use when writing BDD tests, feature files, Gherkin scenarios..."

Remember: Best description = exhaustive trigger list + examples

Dimension 8: Practical Usability (15 points)

Purpose: Ensure skill is immediately useful with clear examples.

Scoring:

  • 13-15 points: Concrete + runnable + clear
  • 10-12 points: Most examples good
  • 7-9 points: Some weak examples
  • 0-6 points: Abstract or missing

Components

  1. Concrete Examples (5 points)

    • Real code, not pseudocode
    • Realistic scenarios
    • Actual file paths, commands
  2. Runnable Code (5 points)

    • Can copy/paste and execute
    • Complete, not fragments
    • Correct syntax
  3. Clear Structure (5 points)

    • Logical organization
    • Scannable headings
    • Code blocks properly formatted

Dimension 9: Eval Validation (20 points) -- HIGHEST PRIORITY

Purpose: Verify the skill has been validated at runtime through tessl eval scenarios, proving agents actually follow its instructions.

Scoring:

  • 17-20 points: Complete evals with >=80% instruction coverage, >=3 valid scenarios
  • 13-16 points: Evals present with partial coverage or incomplete scenarios
  • 7-12 points: Evals directory exists but missing key files
  • 1-6 points: Minimal eval structure, no coverage data
  • 0 points: No evals directory

Core Principle: Static quality (D1-D8) is necessary but not sufficient. Runtime validation proves the skill actually changes agent behavior.

Components

  1. Eval Directory Structure (4 points)

    • evals/ directory exists with proper layout
    • Follows tessl eval harness conventions
  2. Instruction Inventory (3 points)

    • instructions.json present and non-empty
    • Every instruction extracted from SKILL.md
    • Classified by why_given: reminder, new knowledge, preference
  3. Coverage Statistics (6 points)

    • summary.json with instructions_coverage data (3 points)
    • Coverage percentage >= 80% (3 points)
  4. Valid Scenarios (4 points)

    • = 3 scenarios with complete structure (task.md + criteria.json + capability.txt)

    • Each criteria.json sums to exactly 100
  5. Criteria Quality (3 points)

    • 10+ checklist items per scenario
    • Binary yes/no criteria traceable to specific instructions
    • No instruction leakage in task.md

Relationship to D1 and D3

When instructions.json exists, its data enriches other dimensions:

  • D1 (Knowledge Delta): The why_given distribution (new knowledge + preference vs reminders) provides a more accurate expert content ratio than shell heuristics alone.
  • D3 (Anti-Pattern Quality): Instructions containing NEVER/ALWAYS/anti-pattern keywords are cross-referenced with scenario coverage for a stronger signal.

Creating Evals

Use the creating-eval-scenarios skill to generate evaluation scenarios:

# Ensure skill is packaged as a tessl tile first
tessl eval run <tile-path>
tessl eval view-status <status_id> --json

Examples

High Eval Validation (19/20):

skill-name/evals/
  instructions.json      # 28 instructions extracted
  summary.json           # 100% coverage, 5 scenarios
  summary_infeasible.json
  scenario-0/            # task.md + criteria.json (sum=100) + capability.txt
  scenario-1/
  scenario-2/
  scenario-3/
  scenario-4/

Low Eval Validation (4/20):

skill-name/evals/
  instructions.json      # Present but only 5 instructions
  # No summary.json, no scenarios

Zero Eval Validation (0/20):

skill-name/
  SKILL.md               # No evals/ directory at all

Summary: The 140-Point Scale

DimensionMaxPriorityFocus
D1: Knowledge Delta20HIGHESTExpert knowledge only
D2: Mindset + Procedures15HIGHPhilosophy + workflows
D3: Anti-Pattern Quality15HIGHNEVER + WHY + consequences
D4: Specification15MEDIUMDescription field critical
D5: Progressive Disclosure15MEDIUMHub + references + lazy-load guidance
D6: Freedom Calibration15MEDIUMAppropriate rigidity
D7: Pattern Recognition10LOWActivation keywords
D8: Practical Usability15HIGHConcrete examples
D9: Eval Validation20HIGHESTRuntime validation via tessl evals
TOTAL140A-grade = 126+

See Also

  • framework-scoring-rubric.md - Detailed scoring methodology
  • framework-quality-standards.md - A-grade requirements
  • creating-eval-scenarios skill - Tessl eval scenario generation

SKILL.md

tile.json