Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
26
17%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/agent-orchestration-improve-agent/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and lacks concrete actions, explicit trigger conditions, and natural user language. It reads more like a high-level topic heading than a functional skill description. The absence of a 'Use when...' clause and specific actionable capabilities makes it difficult for Claude to reliably select this skill at the right time.
Suggestions
Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user asks to improve, optimize, or debug an AI agent, refine agent prompts, or evaluate agent performance.'
Replace abstract phrases with concrete actions, e.g., 'Analyzes agent outputs for failure patterns, rewrites and refines system prompts, designs evaluation criteria, and runs iterative testing cycles.'
Include natural trigger terms users would say, such as 'optimize agent', 'fix agent behavior', 'improve prompt', 'agent not working', 'evaluate agent responses'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'systematic improvement', 'performance analysis', 'prompt engineering', and 'continuous iteration' without listing concrete actions. It doesn't specify what specific tasks are performed (e.g., 'rewrite prompts', 'analyze error logs', 'run A/B tests'). | 1 / 3 |
Completeness | The description partially addresses 'what' (improving agents through analysis and iteration) but is vague, and completely lacks a 'when' clause or any explicit trigger guidance for when Claude should select this skill. | 1 / 3 |
Trigger Term Quality | Contains some relevant keywords like 'agents', 'prompt engineering', and 'performance analysis' that users might mention, but misses common variations like 'optimize prompts', 'debug agent', 'improve accuracy', 'agent evaluation', 'prompt tuning', or 'agent testing'. | 2 / 3 |
Distinctiveness Conflict Risk | The focus on 'agents' and 'prompt engineering' provides some specificity, but 'performance analysis' and 'continuous iteration' are generic enough to overlap with general debugging, optimization, or code improvement skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
12%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is a comprehensive but overly verbose guide that reads more like a general knowledge article on agent optimization than an actionable skill for Claude. It explains many concepts Claude already understands (versioning, A/B testing, statistical significance), contains no executable code or concrete tool commands, and packs everything into a single monolithic file. The workflow structure is reasonable but undermined by the lack of specificity and validation checkpoints.
Suggestions
Replace placeholder pseudo-commands (e.g., 'Use: context-manager') with actual executable code, CLI commands, or concrete tool invocations that Claude can run.
Cut at least 60% of the content by removing explanations of concepts Claude already knows (semantic versioning, A/B testing methodology, Cohen's d, inter-rater reliability) and focus only on project-specific instructions.
Split detailed sections (evaluation metrics, prompt engineering techniques, deployment procedures) into separate referenced files to improve progressive disclosure.
Add explicit validation gates between phases with concrete pass/fail criteria, e.g., 'Do not proceed to Phase 3 until baseline metrics document is generated and reviewed.'
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~300+ lines. Much of the content describes general best practices Claude already knows (A/B testing methodology, semantic versioning, statistical significance, human evaluation protocols). The skill reads like a textbook chapter rather than actionable instructions, with extensive lists of concepts that don't add novel information. | 1 / 3 |
Actionability | Despite its length, the skill contains no executable code or concrete commands. References like 'Use: context-manager' and 'Use: prompt-engineer' are pseudocode pointing to undefined tools. The code blocks are templates with placeholders ([X%], [Y]) or abstract configuration specs rather than copy-paste-ready commands. There's nothing Claude can directly execute. | 1 / 3 |
Workflow Clarity | The four-phase structure provides a clear sequence (analyze → improve → test → deploy), and the rollback procedures include explicit triggers and steps. However, validation checkpoints between phases are implicit rather than explicit, and the feedback loop between testing failure and re-optimization is not clearly articulated as a concrete decision point. | 2 / 3 |
Progressive Disclosure | The entire skill is a monolithic wall of text with no references to external files despite being long enough to warrant splitting. Detailed content on A/B testing frameworks, evaluation metrics, deployment strategies, and prompt engineering techniques are all inlined when they could be separate reference documents. No bundle files exist to support this content. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
8854d4e
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.