CtrlK
BlogDocsLog inGet started
Tessl Logo

pantheon-ai/skill-quality-auditor

Audit and improve skill collections with a 9-dimension scoring framework (Knowledge Delta, Mindset, Anti-Patterns, Specification Compliance, Progressive Disclosure, Freedom Calibration, Pattern Recognition, Practical Usability, Eval Validation), duplication detection, remediation planning, baseline comparison, and CI quality gates; use when evaluating skill quality, generating remediation plans, detecting duplicates, validating artifact conventions, or enforcing publication thresholds.

93

1.26x
Quality

89%

Does it follow best practices?

Impact

99%

1.26x

Average score across 5 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

framework-scoring-rubric.mdreferences/

category:
framework
priority:
CRITICAL
source:
quality evaluation methodology

Skill Scoring Rubric

Detailed scoring methodology for the 9-dimension quality framework. Use this to understand how scores are calculated and ensure consistent evaluation.

Scoring Overview

Total Possible Score: 140 points
Passing Grade: 105 points (75%)
A-Grade Target: 126 points (90%)
Perfect Score: 140 points (100%)

Dimension-by-Dimension Scoring

D1: Knowledge Delta (20 points)

ScoreCriteriaRedundancy Level
18-20Pure expert knowledge<5%
15-17Mostly expert5-15%
12-14Acceptable balance15-30%
9-11Needs improvement30-50%
0-8Failing>50%

Evaluation Method:

  1. Read entire skill content
  2. Identify content AI assistants already know
  3. Calculate: Expert Content / Total Content
  4. Apply scoring threshold

D2: Mindset + Procedures (15 points)

ScoreCriteria
13-15Clear mindset + detailed procedures + when/when-not
10-12Has most elements, minor gaps
7-9Missing key element
0-6Generic or absent

Component Breakdown:

  • Clear Mindset/Philosophy: 5 points
  • Step-by-Step Procedures: 5 points
  • When/When-Not Guidance: 5 points

D3: Anti-Pattern Quality (15 points)

ScoreCriteria
13-15NEVER lists + concrete examples + consequences
10-12Has most elements
7-9Generic warnings
0-6Missing or weak

Component Breakdown:

  • NEVER Lists with WHY: 5 points
  • Concrete Examples: 5 points
  • Consequences Explained: 5 points

D4: Specification Compliance (15 points)

ScoreCriteria
13-15Perfect spec compliance
10-12Minor issues
7-9Missing key elements
0-6Non-compliant

Component Breakdown:

  • Task Focus Declaration: 4 points
  • Description Field Quality: 6 points
  • Cross-Harness Portability: 3 points (CRITICAL for multi-agent compatibility)
  • Proper Frontmatter: 1 point
  • Activation Keywords: 1 point

Portability Requirements:

  • No harness-specific paths (.opencode/, .claude/, .cursor/): 1 point
  • No agent-specific references (Claude Code, Cursor Agent, etc.): 1 point
  • Relative paths from skill directory (scripts/, references/): 1 point

Bonus Points (each independent, up to +2 total):

  • Script Language Portability: +1 (Python/TS/JS scripts present in scripts/)
  • References Section Format: +1 (heading exactly ## References, last H2, Markdown table with Topic | Reference | When to Use columns, Reference column cells are markdown links)
    • 0 if: wrong heading name, bullet list instead of table, bare URLs, plain-text paths, missing required columns, or section missing when references exist
    • Omission allowed without penalty when skill has nothing to reference

D5: Progressive Disclosure (15 points)

ScoreCriteria
13-15Navigation hub + references/ + categories
10-12Some organization, could improve
7-9Everything frontloaded, >300 lines
0-6No structure, >500 lines

Component Breakdown:

  • Navigation Hub Approach: 8 points
  • References Directory: 4 points
  • Category Organization: 3 points

D6: Freedom Calibration (15 points)

ScoreCriteria
13-15Appropriate for skill type
10-12Slightly too rigid or loose
7-9Mismatched calibration
0-6Completely wrong

Calibration Types:

  • Rigid (Mindset skills): Strong rules, must follow
  • Balanced (Process skills): Clear steps with flexibility
  • Flexible (Tool skills): Options and trade-offs

D7: Pattern Recognition (10 points)

ScoreCriteria
9-10Rich keywords, comprehensive triggers
7-8Good keywords, could expand
5-6Basic keywords
0-4Missing or poor

Evaluation Method:

  • Count domain keywords in description
  • Check trigger scenarios present
  • Verify activation clarity

D8: Practical Usability (15 points)

ScoreCriteria
13-15Concrete + runnable + clear
10-12Most examples good
7-9Some weak examples
0-6Abstract or missing

Component Breakdown:

  • Concrete Examples: 5 points
  • Runnable Code: 5 points
  • Clear Structure: 5 points

D9: Eval Validation (20 points)

ScoreCriteria
17-20Complete evals, >=80% coverage, >=3 valid scenarios
13-16Evals present, partial coverage
7-12Evals directory exists, missing key files
0-6Minimal or no eval structure

Component Breakdown:

  • Eval Directory Structure: 4 points
  • Instruction Inventory (instructions.json): 3 points
  • Coverage Statistics (summary.json): 3 points
  • Coverage >= 80%: 3 points
  • Valid Scenarios (>=3 complete): 4 points
  • Criteria Quality (sum to 100): 3 points

Enrichment: When instructions.json exists, D1 and D3 scores are enriched with instruction classification data (why_given distribution for D1, anti-pattern instruction count for D3).

Grade Assignment

GradeScore RangeInterpretation
A+133-140Exceptional quality
A126-132Meets all standards
B+119-125Strong, minor improvements
B112-118Good, some gaps
C+105-111Acceptable, needs work
C98-104Below standard
D91-97Significant issues
F0-90Failing

Scoring Process

Step 1: Read and Understand

Read the entire skill, including all references if present.

Step 2: Score Each Dimension

Apply rubric to each of 9 dimensions independently.

Step 3: Calculate Total

Sum all 9 dimension scores for total out of 140.

Step 4: Assign Grade

Map total score to grade using grade assignment table.

Step 5: Identify Improvements

For scores below A-grade, identify specific improvements needed.

Common Score Patterns

High Knowledge Delta, Low Usability (18, 10): Expert content but lacks examples
Low Knowledge Delta, High Usability (10, 14): Tutorial-heavy, needs expert focus
Perfect Spec, Poor Content (15, 8): Great frontmatter, weak body
Balanced Scores (12-13 each): Consistent but not exceptional

See Also

  • framework-dimensions.md - Dimension definitions
  • framework-quality-standards.md - A-grade requirements

SKILL.md

tile.json