Audit and improve skill collections with a 9-dimension scoring framework (Knowledge Delta, Mindset, Anti-Patterns, Specification Compliance, Progressive Disclosure, Freedom Calibration, Pattern Recognition, Practical Usability, Eval Validation), duplication detection, remediation planning, baseline comparison, and CI quality gates; use when evaluating skill quality, generating remediation plans, detecting duplicates, validating artifact conventions, or enforcing publication thresholds.
93
89%
Does it follow best practices?
Impact
99%
1.26xAverage score across 5 eval scenarios
Passed
No known issues
Complete evaluation methodology for assessing skill quality using the 9-dimension quality framework. This is the foundation for all quality auditing.
Canonical source reference: framework-dimensions.md
The 9-dimension quality framework evaluates skills across 9 dimensions totaling 140 points. Dimension 1 (Knowledge Delta) and Dimension 9 (Eval Validation) carry the highest weight at 20 points each - skills must contain expert-only knowledge AND be validated at runtime via tessl eval scenarios.
Target Score: ≥126 points (90%) = A-grade
Purpose: Ensure skill contains expert-only knowledge, not redundant information.
Scoring:
Core Principle: Skill = Expert Knowledge - What AI Assistants Already Know
Expert (KEEP):
Activation (BRIEF REMINDERS OK):
Redundant (DELETE):
❌ Teaching basic syntax (AI assistants know if/else, function, class)
❌ Copying official documentation (schema definitions, rule lists)
❌ Explaining fundamentals (what is REST, what is a database)
❌ Generic advice (write tests, use version control)
❌ Installation tutorials (npm install, pip install)
❌ Low Knowledge Delta (12/20):
# TypeScript Basics
## Variables
Use `let` for mutable, `const` for immutable:
let count = 0
const name = "Alice"
## Functions
Functions can be declared or arrow:
function add(a: number, b: number) { return a + b }
const add = (a: number, b: number) => a + bProblem: AI assistants already know basic TypeScript syntax.
✅ High Knowledge Delta (19/20):
# TypeScript: Making Illegal States Unrepresentable
## The Pattern
Use discriminated unions to eliminate impossible states:
❌ BAD: Multiple optional fields create 16 possible states
type Request = {
loading?: boolean
error?: string
data?: User
}
✅ GOOD: Tagged union with 3 valid states only
type Request =
| { status: 'loading' }
| { status: 'error', error: string }
| { status: 'success', data: User }
## Why This Matters
Bad design allows bugs: `{ loading: true, data: user }` is impossible but TypeScript allows it.
Good design: TypeScript prevents impossible states at compile time.Expert pattern AI assistants don't know by default.
Purpose: Provide philosophical framing and step-by-step workflows.
Scoring:
Clear Mindset/Philosophy (5 points)
Step-by-Step Procedures (5 points)
When/When-Not Guidance (5 points)
✅ Strong Mindset + Procedures (15/15):
# Test-Driven Development
## Mindset
Write tests BEFORE implementation. The test defines the contract; implementation fulfills it.
## Workflow
1. Red: Write failing test (verify it fails)
2. Green: Minimum code to pass
3. Refactor: Improve without breaking tests
## When to Apply
✅ New functions, features, bug fixes (reproduce first)
❌ UI styling, configuration, documentation
## When NOT to Apply
- Throwaway prototypes
- Generated code
- Trivial getters/settersPurpose: Teach what NOT to do with clear explanations of WHY.
Scoring:
NEVER Lists with WHY (5 points)
Concrete Examples (5 points)
Consequences Explained (5 points)
✅ Strong Anti-Patterns (14/15):
## Anti-Patterns
❌ **NEVER use string interpolation for SQL**
WHY: Opens SQL injection vulnerabilities
// BAD - Vulnerable to injection
db.query(`SELECT * FROM users WHERE id = ${userId}`)
// GOOD - Safe with prepared statements
db.query('SELECT * FROM users WHERE id = ?', [userId])
**Consequence:** Attacker can inject `1 OR 1=1` to dump entire table.
❌ **NEVER skip test failure verification**
WHY: False positives waste hours debugging phantom issues
**Consequence:** Test passes even with bugs, leading to production failures.Purpose: Ensure proper frontmatter, single-task focus, activation keywords, and cross-harness portability.
Scoring:
Task Focus Declaration (4 points) ⭐ CRITICAL
Description Field Quality (6 points)
Cross-Harness Portability (3 points) ⭐ CRITICAL
.opencode/, .claude/, .cursor/, .aider/, .continue/scripts/, references/, templates/)Self-Containment (penalties: up to -12 points) ⭐ CRITICAL
SKILL.md penalties (checked outside fenced code blocks):
../ references outside fenced code blocks. Skills are installed to arbitrary locations; parent paths break when the skill is not in its original repo.skills/X/Y/Z or other hardcoded repository paths outside fenced code blocks. Cross-skill dependencies should use skill names, not file paths..context/, .agents/, or other repo-root directories outside fenced code blocks.scripts/ penalties (-1 per file with violation, cap -2 per category):
skills/X/Y/Z paths loses 1 point (capped at -2 total)..context/ or .agents/ paths loses 1 point (capped at -2 total).references/ penalties (-1 per file with violation, cap -2 per category):
No absolute repo paths in references (-2 max): Each reference file referencing skills/X/Y/Z paths loses 1 point (capped at -2 total).
No repo-root directory references in references (-2 max): Each reference file referencing .context/ or .agents/ paths loses 1 point (capped at -2 total).
WHY: Skills must be fully self-contained. When installed via tessl install or npx skills add, they land in arbitrary directories. Any reference to files outside the skill's own directory tree will break — whether in SKILL.md, scripts, or reference files.
IMPACT: Non-self-contained skills fail silently when installed outside their authoring repo.
Script Language Portability (bonus: +1 point)
scripts/ containing Python (.py), TypeScript (.ts), or JavaScript (.js) files earn a portability bonus..sh) remain the accepted default and receive no penalty.#!/usr/bin/env python3 (Python), #!/usr/bin/env bun (TypeScript), #!/usr/bin/env node (JavaScript)jq, and has GNU-vs-BSD divergence for grep/sed/awk).Proper Frontmatter (1 point)
Activation Keywords (1 point)
References Section Format (bonus: +1 point)
## References, it is the last H2 in SKILL.md, content is a Markdown table with Topic | Reference | When to Use columns, every Reference cell is a markdown link## Resources, ## Quick Reference), bullet list instead of table, bare URLs, plain-text paths, or missing required columnsreferences/ directory and no external resources may omit the section entirely✅ Excellent Specification Compliance (15/15):
---
name: bdd-testing
description: Behavior-Driven Development with Given-When-Then scenarios, Cucumber.js, Three Amigos collaboration, Example Mapping, living documentation, and acceptance criteria. Use when writing BDD tests, feature files, or planning discovery workshops.
---
# BDD Testing
Execute test runner with portable path:
```bash
bun run scripts/run-tests.shReference files use relative paths: references/file.md
*Perfect: comprehensive description, portable paths (scripts/, references/), no agent mentions*
**❌ Poor Specification Compliance (7/15):**
```yaml
---
name: bdd-testing
description: BDD testing patterns
---
# BDD Testing
For Claude Code users, run:
```bash
.opencode/scripts/run-tests.shFor Cursor users, see .claude/docs/file.md
*Problems: weak description, harness-specific paths (.opencode/, .claude/), agent-specific references*
### References Section Standard
Every SKILL.md that has references or external resources MUST end with a `## References` section using a **3-column Markdown table** with columns `Topic`, `Reference`, and `When to Use`:
```markdown
## References
| Topic | Reference | When to Use |
| --- | --- | --- |
| Security patterns, caching, and trigger configuration | [Best Practices](references/best-practices.md) | Every time you generate a workflow |
| Pinned action versions and input/output specs | [Common Actions](references/common-actions.md) | When using any public action |
| Official workflow syntax and expression reference | [GitHub Actions Docs](https://docs.github.com/en/actions) | For syntax lookup |Sub-sections (H3 headings) are allowed to group rows by theme when a skill has many references:
## References
### Generators
| Topic | Reference | When to Use |
| --- | --- | --- |
| Tree API patterns for file operations | [Tree API Reference](references/tree-api-reference.md) | Any generator that reads or writes files |
### Executors
| Topic | Reference | When to Use |
| --- | --- | --- |
| ExecutorContext fields and lifecycle | [Executor Context API](references/executor-context-api.md) | Building a custom executor |Rules:
| Rule | Requirement |
|---|---|
| Heading | Exactly ## References — no variants (## Resources, ## See Also, etc.) |
| Position | Last H2 section in the file |
| Format | Markdown table with Topic | Reference | When to Use columns — no bullet lists, no bare URLs |
| Reference column | Every cell in the Reference column MUST be a markdown link [text](url) |
| Topic column | One-line description of what the referenced file or resource covers |
| When to Use column | Concrete scenario that tells the agent when to load or consult the reference |
| Sub-sections | Optional H3 headings are allowed to group rows by theme |
| Omission | Allowed only when the skill has nothing to reference (no penalty) |
❌ Non-compliant (0 bonus points):
## Resources
- references/file.md
- https://example.com/docsProblems: wrong heading (## Resources), bullet list instead of table, bare path and bare URL.
## References
- [Error Patterns](references/error-patterns.md) — common failure modesProblems: bullet list format — table with Topic/Reference/When to Use is required.
✅ Compliant (+1 bonus point):
## References
| Topic | Reference | When to Use |
| --- | --- | --- |
| Common failure modes and remediation steps | [Error Patterns](references/error-patterns.md) | When diagnosing unexpected output or synth failures |
| Authoritative command and flag reference | [Official CLI Docs](https://example.com/cli) | For exact flag syntax lookup |Purpose: Structure content for on-demand loading, not frontloading everything.
Scoring:
Navigation Hub Approach (5 points)
References Directory (4 points)
Category Organization (3 points)
Lazy Loading Guidance (3 points) ⭐ REQUIRED
❌ NEVER list references without "When to Use" conditions
## References
- references/dimensions.md
- references/scoring.md
- references/anti-patterns.mdThis forces agents to load everything or guess what to load.
❌ NEVER use vague "When to Use" entries
| Topic | Reference | When to Use |
| --- | --- | --- |
| Scoring | [Scoring Rubric](references/scoring.md) | For scoring |
| Anti-patterns | [Anti-Patterns](references/anti-patterns.md) | For anti-patterns |"For scoring" is not actionable — it does not tell the agent when NOT to load the file.
✅ Explicit lazy-load conditions:
| Topic | Reference | When to Use |
| --- | --- | --- |
| Per-dimension criteria and bonus rules | [Dimensions](references/dimensions.md) | Evaluating any individual dimension or understanding the rubric |
| Score thresholds and grade bands | [Scoring Rubric](references/scoring.md) | Calculating a total score or assigning a grade — skip if only auditing structure |
| NEVER/WHY/BAD/GOOD failure modes | [Anti-Patterns](references/anti-patterns.md) | Explaining why a dimension scored low or writing remediation guidance |✅ AGENTS.md with explicit lazy-load instruction:
## Usage Instructions
1. Read SKILL.md — navigation hub only
2. Identify your task from the task categories below
3. Load ONLY the references listed for that task
4. Do NOT pre-load all references✅ Excellent Progressive Disclosure (15/15):
bdd-testing/
├── SKILL.md (64 lines - navigation hub with actionable "When to Use" per reference)
├── AGENTS.md (explicit: "load only references needed for current task")
└── references/
├── principles-three-amigos.md (CRITICAL, 250 lines)
├── gherkin-syntax.md (HIGH, 180 lines)
└── practices-tags.md (MEDIUM, 120 lines)❌ Poor Progressive Disclosure (6/15):
bdd-testing/
└── SKILL.md (1,800 lines - everything frontloaded)❌ Missing Lazy Loading (10/15 — loses 3 points):
bdd-testing/
├── SKILL.md (80 lines - good hub, but references table has no "When to Use" column)
├── AGENTS.md (says "load all references before starting")
└── references/
├── principles-three-amigos.md
├── gherkin-syntax.md
└── practices-tags.mdPurpose: Balance prescription (rigid rules) vs flexibility (guidelines).
Scoring:
Rigid (Mindset skills): Strong rules, must follow
Balanced (Process skills): Clear steps with flexibility
Flexible (Tool skills): Options and trade-offs
✅ Well-Calibrated (14/15):
# Proof of Work (Mindset skill)
## Zero-Tolerance Rules
NEVER trust agent completion reports without verification.
ALWAYS show command output as proof.
ZERO exceptions to verification protocol.Appropriately rigid for critical verification.
❌ Miscalibrated (7/15):
# TypeScript Basics (Tool skill)
## Rules
ALWAYS use const for all variables.
NEVER use let or var under any circumstances.Too rigid - let has valid use cases.
Purpose: Ensure skill activates when needed via description keywords.
Scoring:
Remember: Best description = exhaustive trigger list + examples
Purpose: Ensure skill is immediately useful with clear examples.
Scoring:
Concrete Examples (5 points)
Runnable Code (5 points)
Clear Structure (5 points)
Purpose: Verify the skill has been validated at runtime through tessl eval scenarios, proving agents actually follow its instructions.
Scoring:
Core Principle: Static quality (D1-D8) is necessary but not sufficient. Runtime validation proves the skill actually changes agent behavior.
Eval Directory Structure (4 points)
evals/ directory exists with proper layoutInstruction Inventory (3 points)
instructions.json present and non-emptywhy_given: reminder, new knowledge, preferenceCoverage Statistics (6 points)
summary.json with instructions_coverage data (3 points)Valid Scenarios (4 points)
= 3 scenarios with complete structure (task.md + criteria.json + capability.txt)
Criteria Quality (3 points)
When instructions.json exists, its data enriches other dimensions:
why_given distribution (new knowledge + preference vs reminders) provides a more accurate expert content ratio than shell heuristics alone.Use the creating-eval-scenarios skill to generate evaluation scenarios:
# Ensure skill is packaged as a tessl tile first
tessl eval run <tile-path>
tessl eval view-status <status_id> --jsonHigh Eval Validation (19/20):
skill-name/evals/
instructions.json # 28 instructions extracted
summary.json # 100% coverage, 5 scenarios
summary_infeasible.json
scenario-0/ # task.md + criteria.json (sum=100) + capability.txt
scenario-1/
scenario-2/
scenario-3/
scenario-4/Low Eval Validation (4/20):
skill-name/evals/
instructions.json # Present but only 5 instructions
# No summary.json, no scenariosZero Eval Validation (0/20):
skill-name/
SKILL.md # No evals/ directory at all| Dimension | Max | Priority | Focus |
|---|---|---|---|
| D1: Knowledge Delta | 20 | HIGHEST | Expert knowledge only |
| D2: Mindset + Procedures | 15 | HIGH | Philosophy + workflows |
| D3: Anti-Pattern Quality | 15 | HIGH | NEVER + WHY + consequences |
| D4: Specification | 15 | MEDIUM | Description field critical |
| D5: Progressive Disclosure | 15 | MEDIUM | Hub + references + lazy-load guidance |
| D6: Freedom Calibration | 15 | MEDIUM | Appropriate rigidity |
| D7: Pattern Recognition | 10 | LOW | Activation keywords |
| D8: Practical Usability | 15 | HIGH | Concrete examples |
| D9: Eval Validation | 20 | HIGHEST | Runtime validation via tessl evals |
| TOTAL | 140 | A-grade = 126+ |
framework-scoring-rubric.md - Detailed scoring methodologyframework-quality-standards.md - A-grade requirementscreating-eval-scenarios skill - Tessl eval scenario generationassets
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references
scripts