Audit and improve skill collections with an 8-dimension scoring framework, duplication detection, remediation planning, and CI quality gates; use when evaluating skill quality, generating remediation plans, validating report format, or enforcing repository-wide skill artifact conventions.
Does it follow best practices?
Evaluation — 93%
↑ 1.33xAgent success when using this tile
Validation for skill structure
Complete evaluation methodology for assessing skill quality using the skill-judge framework. This is the foundation for all quality auditing.
Canonical source reference: framework-skill-judge-canonical.md
The skill-judge framework evaluates skills across 8 dimensions totaling 120 points. Dimension 1 (Knowledge Delta) is most important - skills must contain expert-only knowledge, not concepts AI assistants already know.
Target Score: ≥108 points (90%) = A-grade
Purpose: Ensure skill contains expert-only knowledge, not redundant information.
Scoring:
Core Principle: Skill = Expert Knowledge - What AI Assistants Already Know
Expert (KEEP):
Activation (BRIEF REMINDERS OK):
Redundant (DELETE):
❌ Teaching basic syntax (AI assistants know if/else, function, class)
❌ Copying official documentation (schema definitions, rule lists)
❌ Explaining fundamentals (what is REST, what is a database)
❌ Generic advice (write tests, use version control)
❌ Installation tutorials (npm install, pip install)
❌ Low Knowledge Delta (12/20):
# TypeScript Basics
## Variables
Use `let` for mutable, `const` for immutable:
let count = 0
const name = "Alice"
## Functions
Functions can be declared or arrow:
function add(a: number, b: number) { return a + b }
const add = (a: number, b: number) => a + bProblem: AI assistants already know basic TypeScript syntax
✅ High Knowledge Delta (19/20):
# TypeScript: Making Illegal States Unrepresentable
## The Pattern
Use discriminated unions to eliminate impossible states:
❌ BAD: Multiple optional fields create 16 possible states
type Request = {
loading?: boolean
error?: string
data?: User
}
✅ GOOD: Tagged union with 3 valid states only
type Request =
| { status: 'loading' }
| { status: 'error', error: string }
| { status: 'success', data: User }
## Why This Matters
Bad design allows bugs: `{ loading: true, data: user }` is impossible but TypeScript allows it.
Good design: TypeScript prevents impossible states at compile time.Expert pattern AI assistants don't know by default
Purpose: Provide philosophical framing and step-by-step workflows.
Scoring:
Clear Mindset/Philosophy (5 points)
Step-by-Step Procedures (5 points)
When/When-Not Guidance (5 points)
✅ Strong Mindset + Procedures (15/15):
# Test-Driven Development
## Mindset
Write tests BEFORE implementation. The test defines the contract; implementation fulfills it.
## Workflow
1. Red: Write failing test (verify it fails)
2. Green: Minimum code to pass
3. Refactor: Improve without breaking tests
## When to Apply
✅ New functions, features, bug fixes (reproduce first)
❌ UI styling, configuration, documentation
## When NOT to Apply
- Throwaway prototypes
- Generated code
- Trivial getters/settersPurpose: Teach what NOT to do with clear explanations of WHY.
Scoring:
NEVER Lists with WHY (5 points)
Concrete Examples (5 points)
Consequences Explained (5 points)
✅ Strong Anti-Patterns (14/15):
## Anti-Patterns
❌ **NEVER use string interpolation for SQL**
WHY: Opens SQL injection vulnerabilities
// BAD - Vulnerable to injection
db.query(`SELECT * FROM users WHERE id = ${userId}`)
// GOOD - Safe with prepared statements
db.query('SELECT * FROM users WHERE id = ?', [userId])
**Consequence:** Attacker can inject `1 OR 1=1` to dump entire table.
❌ **NEVER skip test failure verification**
WHY: False positives waste hours debugging phantom issues
**Consequence:** Test passes even with bugs, leading to production failures.Purpose: Ensure proper frontmatter, single-task focus, and activation keywords.
Scoring:
Task Focus Declaration (5 points) ⭐ CRITICAL
Description Field Quality (7 points)
Proper Frontmatter (2 points)
Activation Keywords (1 point)
✅ Excellent Description (10/10):
---
name: bdd-testing
description: Behavior-Driven Development with Given-When-Then scenarios, Cucumber.js, Three Amigos collaboration, Example Mapping, living documentation, and acceptance criteria. Use when writing BDD tests, feature files, or planning discovery workshops.
---Comprehensive, includes triggers, explains when to use
❌ Weak Description (4/10):
---
name: bdd-testing
description: BDD testing patterns
---Too generic, no activation keywords, no usage guidance
Purpose: Structure content for on-demand loading, not frontloading everything.
Scoring:
Navigation Hub Approach (8 points)
References Directory (4 points)
Category Organization (3 points)
✅ Excellent Progressive Disclosure (15/15):
bdd-testing/
├── SKILL.md (64 lines - navigation hub)
├── AGENTS.md (reference guide)
└── references/
├── principles-three-amigos.md (CRITICAL, 250 lines)
├── gherkin-syntax.md (HIGH, 180 lines)
└── practices-tags.md (MEDIUM, 120 lines)❌ Poor Progressive Disclosure (6/15):
bdd-testing/
└── SKILL.md (1,800 lines - everything frontloaded)Purpose: Balance prescription (rigid rules) vs flexibility (guidelines).
Scoring:
Rigid (Mindset skills): Strong rules, must follow
Balanced (Process skills): Clear steps with flexibility
Flexible (Tool skills): Options and trade-offs
✅ Well-Calibrated (14/15):
# Proof of Work (Mindset skill)
## Zero-Tolerance Rules
NEVER trust agent completion reports without verification.
ALWAYS show command output as proof.
ZERO exceptions to verification protocol.Appropriately rigid for critical verification
❌ Miscalibrated (7/15):
# TypeScript Basics (Tool skill)
## Rules
ALWAYS use const for all variables.
NEVER use let or var under any circumstances.Too rigid - let has valid use cases
Purpose: Ensure skill activates when needed via description keywords.
Scoring:
Remember: Best description = exhaustive trigger list + examples
Purpose: Ensure skill is immediately useful with clear examples.
Scoring:
Concrete Examples (5 points)
Runnable Code (5 points)
Clear Structure (5 points)
| Dimension | Max | Focus |
|---|---|---|
| D1: Knowledge Delta | 20 | Expert knowledge only |
| D2: Mindset + Procedures | 15 | Philosophy + workflows |
| D3: Anti-Pattern Quality | 15 | NEVER + WHY + consequences |
| D4: Specification | 15 | Description field critical |
| D5: Progressive Disclosure | 15 | Hub + references |
| D6: Freedom Calibration | 15 | Appropriate rigidity |
| D7: Pattern Recognition | 10 | Activation keywords |
| D8: Practical Usability | 15 | Concrete examples |
| TOTAL | 120 | A-grade = 108+ |
framework-scoring-rubric.md - Detailed scoring methodologyframework-quality-standards.md - A-grade requirements.agents/skills/skill-judge/Install with Tessl CLI
npx tessl i pantheon-ai/skill-quality-auditorevals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references
scripts