Comprehensive quality audit for Claude Code agents, skills, and commands with comparative analysis
Install with Tessl CLI
npx tessl i github:FlorianBruniaux/claude-code-ultimate-guide --skill audit-agents-skills54
Quality
33%
Does it follow best practices?
Impact
98%
2.08xAverage score across 3 eval scenarios
Optimize this skill with Tessl
npx tessl skill review --optimize ./examples/skills/audit-agents-skills/SKILL.mdComprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.
Problem: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).
Solution: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).
Key Features:
| Mode | Usage | Output |
|---|---|---|
| Quick Audit | Top-5 critical criteria only | Fast pass/fail (3-5 min for 20 files) |
| Full Audit | All 16 criteria per file | Detailed scores + recommendations (10-15 min) |
| Comparative | Full + benchmark vs templates | Analysis + gap identification (15-20 min) |
Default: Full Audit (recommended for first run)
The 16-criteria framework is derived from:
Weight Rationale:
Grade Standards:
Industry Alignment: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).
Scan directories:
.claude/agents/
.claude/skills/
.claude/commands/
examples/agents/ (if exists)
examples/skills/ (if exists)
examples/commands/ (if exists)Classify files by type (agent/skill/command)
Load reference templates (for Comparative mode):
guide/examples/agents/ (benchmark files)
guide/examples/skills/ (benchmark files)
guide/examples/commands/ (benchmark files)Load scoring criteria from scoring/criteria.yaml:
agents:
max_points: 32
categories:
identity:
weight: 3
criteria:
- id: A1.1
name: "Clear name"
points: 3
detection: "frontmatter.name exists and is descriptive"
# ... (16 total criteria)For each file:
(points / max_points) × 100For each project file:
template_score - project_scoreExample:
Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C)
Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)
Gaps:
- Anti-hallucination measures: -2 points (template has, project missing)
- Edge cases documented: -1 point (template has 5 examples, project has 1)
- Integration documented: -1 point (template references 3 skills, project none)
Total gap: 16 points (explains C vs A difference)Markdown Report (audit-report.md):
JSON Output (audit-report.json):
{
"metadata": {
"project_path": "/path/to/project",
"audit_date": "2026-02-07",
"mode": "full",
"version": "1.0.0"
},
"summary": {
"overall_score": 82.5,
"overall_grade": "B",
"total_files": 15,
"production_ready_count": 10,
"production_ready_percentage": 66.7
},
"by_type": {
"agents": { "count": 5, "avg_score": 85.2, "grade": "B" },
"skills": { "count": 8, "avg_score": 78.9, "grade": "C" },
"commands": { "count": 2, "avg_score": 92.0, "grade": "A" }
},
"files": [
{
"path": ".claude/agents/debugging-specialist.md",
"type": "agent",
"score": 78.1,
"grade": "C",
"points_obtained": 25,
"points_max": 32,
"failed_criteria": [
{
"id": "A2.4",
"name": "Anti-hallucination measures",
"points_lost": 2,
"recommendation": "Add section on source verification"
}
]
}
],
"top_issues": [
{
"issue": "Missing error handling",
"affected_files": 8,
"impact": "Runtime failures unhandled",
"priority": "high"
}
]
}For each failing criterion, generate actionable fix:
### File: .claude/agents/debugging-specialist.md
**Issue**: Missing anti-hallucination measures (2 points lost)
**Fix**:
Add this section after "Methodology":
## Source Verification
- Always cite sources for technical claims
- Use phrases: "According to [documentation]...", "Based on [tool output]..."
- If uncertain, state: "I don't have verified information on..."
- Never invent: statistics, version numbers, API signatures, stack traces
**Detection**: Grep for keywords: "verify", "cite", "source", "evidence"See scoring/criteria.yaml for complete definitions. Summary:
| Category | Weight | Criteria Count | Max Points |
|---|---|---|---|
| Identity | 3x | 4 | 12 |
| Prompt Quality | 2x | 4 | 8 |
| Validation | 1x | 4 | 4 |
| Design | 2x | 4 | 8 |
Key Criteria:
| Category | Weight | Criteria Count | Max Points |
|---|---|---|---|
| Structure | 3x | 4 | 12 |
| Content | 2x | 4 | 8 |
| Technical | 1x | 4 | 4 |
| Design | 2x | 4 | 8 |
Key Criteria:
/Users/, /home/| Category | Weight | Criteria Count | Max Points |
|---|---|---|---|
| Structure | 3x | 4 | 12 |
| Quality | 2x | 4 | 8 |
Key Criteria:
$ARGUMENTSimport yaml
import re
def parse_frontmatter(content):
match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
if match:
return yaml.safe_load(match.group(1))
return Nonedef has_keywords(text, keywords):
text_lower = text.lower()
return any(kw in text_lower for kw in keywords)
# Example
has_trigger = has_keywords(description, ['when', 'use', 'trigger'])
has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])def jaccard_similarity(text1, text2):
words1 = set(text1.lower().split())
words2 = set(text2.lower().split())
intersection = words1 & words2
union = words1 | words2
return len(intersection) / len(union) if union else 0
# Flag if similarity > 0.5 (50% keyword overlap)
if jaccard_similarity(desc1, desc2) > 0.5:
issues.append("High overlap with another file")def estimate_tokens(text):
# Rough estimate: 1 token ≈ 0.75 words
word_count = len(text.split())
return int(word_count * 1.3)
# Check budget
tokens = estimate_tokens(file_content)
if tokens > 5000:
issues.append("File too large (>5K tokens)")Source: LangChain Agent Report 2026 (public report, page 14-22)
Key Findings:
Implications:
This skill addresses:
# Quick Audit: Agents/Skills/Commands
**Files**: 15 (5 agents, 8 skills, 2 commands)
**Critical Issues**: 3 files fail top-5 criteria
## Top-5 Criteria (Pass/Fail)
| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples |
|------|------------|--------------|----------------|--------------------|----------|
| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ |
| skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |
## Action Required
1. **Add error handling**: 5 files
2. **Remove hardcoded paths**: 3 files
3. **Add usage examples**: 4 filesSee Phase 4: Report Generation above for full structure.
# Comparative Audit
## Project vs Templates
| File | Project Score | Template Score | Gap | Top Missing |
|------|---------------|----------------|-----|-------------|
| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases |
| testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |
## Recommendations
Focus on these gaps to reach template quality:
1. **Anti-hallucination measures** (8 files): Add source verification sections
2. **Edge case documentation** (5 files): Add failure scenario examples
3. **Integration documentation** (4 files): List compatible agents/skills# In Claude Code
Use skill: audit-agents-skills
# Specify path
Use skill: audit-agents-skills for ~/projects/my-app# Quick audit (fast)
Use skill: audit-agents-skills with mode=quick
# Comparative (benchmark analysis)
Use skill: audit-agents-skills with mode=comparative
# Generate fixes
Use skill: audit-agents-skills with fixes=true
# Custom output path
Use skill: audit-agents-skills with output=~/Desktop/audit.json# For programmatic integration
Use skill: audit-agents-skills with format=json output=audit.json#!/bin/bash
# .git/hooks/pre-commit
# Run quick audit on changed agent/skill/command files
changed_files=$(git diff --cached --name-only | grep -E "^\.claude/(agents|skills|commands)/")
if [ -n "$changed_files" ]; then
echo "Running quick audit on changed files..."
# Run audit (requires Claude Code CLI wrapper)
# Exit with 1 if any file scores <80%
finame: Audit Agents/Skills
on: [pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run quality audit
run: |
# Run audit skill
# Parse JSON output
# Fail if overall_score < 80| Aspect | Command (/audit-agents-skills) | Skill (this file) |
|---|---|---|
| Scope | Current project only | Multi-project, comparative |
| Output | Markdown report | Markdown + JSON |
| Speed | Fast (5-10 min) | Slower (10-20 min with comparative) |
| Depth | Standard 16 criteria | Same + benchmark analysis |
| Fix suggestions | Via --fix flag | Built-in with recommendations |
| Programmatic | Terminal output | JSON for CI/CD integration |
| Best for | Quick checks, dev workflow | Deep audits, quality tracking |
Recommendation: Use command for daily checks, skill for release gates and quality tracking.
Edit scoring/criteria.yaml:
agents:
categories:
identity:
criteria:
- id: A1.5 # New criterion
name: "API versioning specified"
points: 3
detection: "mentions API version or compatibility"Version bump: Increment version in frontmatter when criteria change.
To support new file types (e.g., "workflows"):
scoring/criteria.yaml:
workflows:
max_points: 24
categories: [...].claude/commands/audit-agents-skills.mdexamples/agents/, examples/skills/, examples/commands/v1.0.0 (2026-02-07):
Skill ready for use: audit-agents-skills
352e748
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.