Audit and improve skill collections with an 8-dimension scoring framework, duplication detection, remediation planning, and CI quality gates; use when evaluating skill quality, generating remediation plans, validating report format, or enforcing repository-wide skill artifact conventions.
Does it follow best practices?
Evaluation — 93%
↑ 1.33xAgent success when using this tile
Validation for skill structure
Detailed scoring methodology for the skill-judge framework. Use this to understand how scores are calculated and ensure consistent evaluation.
Total Possible Score: 120 points
Passing Grade: 90 points (75%)
A-Grade Target: 108 points (90%)
Perfect Score: 120 points (100%)
| Score | Criteria | Redundancy Level |
|---|---|---|
| 18-20 | Pure expert knowledge | <5% |
| 15-17 | Mostly expert | 5-15% |
| 12-14 | Acceptable balance | 15-30% |
| 9-11 | Needs improvement | 30-50% |
| 0-8 | Failing | >50% |
Evaluation Method:
| Score | Criteria |
|---|---|
| 13-15 | Clear mindset + detailed procedures + when/when-not |
| 10-12 | Has most elements, minor gaps |
| 7-9 | Missing key element |
| 0-6 | Generic or absent |
Component Breakdown:
| Score | Criteria |
|---|---|
| 13-15 | NEVER lists + concrete examples + consequences |
| 10-12 | Has most elements |
| 7-9 | Generic warnings |
| 0-6 | Missing or weak |
Component Breakdown:
| Score | Criteria |
|---|---|
| 13-15 | Perfect spec compliance |
| 10-12 | Minor issues |
| 7-9 | Missing key elements |
| 0-6 | Non-compliant |
Component Breakdown:
| Score | Criteria |
|---|---|
| 13-15 | Navigation hub + references/ + categories |
| 10-12 | Some organization, could improve |
| 7-9 | Everything frontloaded, >300 lines |
| 0-6 | No structure, >500 lines |
Component Breakdown:
| Score | Criteria |
|---|---|
| 13-15 | Appropriate for skill type |
| 10-12 | Slightly too rigid or loose |
| 7-9 | Mismatched calibration |
| 0-6 | Completely wrong |
Calibration Types:
| Score | Criteria |
|---|---|
| 9-10 | Rich keywords, comprehensive triggers |
| 7-8 | Good keywords, could expand |
| 5-6 | Basic keywords |
| 0-4 | Missing or poor |
Evaluation Method:
| Score | Criteria |
|---|---|
| 13-15 | Concrete + runnable + clear |
| 10-12 | Most examples good |
| 7-9 | Some weak examples |
| 0-6 | Abstract or missing |
Component Breakdown:
| Grade | Score Range | Interpretation |
|---|---|---|
| A+ | 114-120 | Exceptional quality |
| A | 108-113 | Meets all standards |
| B+ | 102-107 | Strong, minor improvements |
| B | 96-101 | Good, some gaps |
| C+ | 90-95 | Acceptable, needs work |
| C | 84-89 | Below standard |
| D | 78-83 | Significant issues |
| F | 0-77 | Failing |
Read the entire skill, including all references if present.
Apply rubric to each of 8 dimensions independently.
Sum all 8 dimension scores for total out of 120.
Map total score to grade using grade assignment table.
For scores below A-grade, identify specific improvements needed.
High Knowledge Delta, Low Usability (18, 10): Expert content but lacks examples
Low Knowledge Delta, High Usability (10, 14): Tutorial-heavy, needs expert focus
Perfect Spec, Poor Content (15, 8): Great frontmatter, weak body
Balanced Scores (12-13 each): Consistent but not exceptional
framework-skill-judge-dimensions.md - Dimension definitionsframework-quality-standards.md - A-grade requirementsInstall with Tessl CLI
npx tessl i pantheon-ai/skill-quality-auditor@0.1.4evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references
scripts