Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
40
13%
Does it follow best practices?
Impact
81%
1.24xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/agent-orchestration-improve-agent/SKILL.mdQuality
Discovery
14%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description relies heavily on abstract buzzwords ('systematic improvement', 'continuous iteration') without specifying concrete actions or when the skill should be triggered. It lacks a 'Use when...' clause and would be difficult for Claude to distinguish from other skills related to prompt engineering, agent development, or general optimization tasks.
Suggestions
Add a 'Use when...' clause with specific trigger scenarios, e.g., 'Use when the user wants to improve an existing agent's performance, debug agent behavior, refine prompts, or run evaluations.'
Replace abstract language with concrete actions, e.g., 'Analyzes agent outputs against expected results, rewrites system prompts, designs evaluation criteria, and iterates on tool definitions.'
Include natural trigger terms users would say, such as 'optimize agent', 'improve prompts', 'agent not working', 'eval results', 'agent accuracy', or 'prompt tuning'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses vague, abstract language like 'systematic improvement', 'performance analysis', and 'continuous iteration' without listing concrete actions. These are buzzwords rather than specific capabilities. | 1 / 3 |
Completeness | The 'what' is vaguely stated and the 'when' is entirely missing. There is no 'Use when...' clause or equivalent explicit trigger guidance, which per the rubric should cap completeness at 2, but since the 'what' is also weak, this scores a 1. | 1 / 3 |
Trigger Term Quality | Contains some relevant keywords like 'agents', 'prompt engineering', and 'performance analysis' that users might mention, but misses common variations like 'optimize prompts', 'debug agent', 'improve accuracy', 'eval', 'benchmarks', or 'agent tuning'. | 2 / 3 |
Distinctiveness Conflict Risk | The description is very generic and could overlap with many skills related to prompt writing, code optimization, debugging, testing, or general agent development. 'Systematic improvement' and 'continuous iteration' are too broad to carve out a clear niche. | 1 / 3 |
Total | 5 / 12 Passed |
Implementation
12%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is excessively verbose and largely describes general knowledge about agent optimization, A/B testing, deployment strategies, and prompt engineering that Claude already possesses. It lacks any concrete, executable code or real tool integrations—all 'code blocks' are pseudocode with placeholder tools. The content would benefit enormously from being condensed to ~50 lines of genuinely novel, actionable guidance with references to separate detail files.
Suggestions
Replace all pseudocode blocks with real, executable commands or code examples that reference actual tools and produce concrete outputs.
Reduce the content by 70-80% by removing general knowledge (A/B testing methodology, version numbering conventions, evaluation metrics definitions) and keeping only project-specific or novel instructions.
Split detailed phase content into separate referenced files (e.g., PROMPT_ENGINEERING.md, TESTING.md, DEPLOYMENT.md) and keep SKILL.md as a concise overview with clear navigation links.
Add explicit validation checkpoints between phases (e.g., 'Do not proceed to Phase 2 until baseline report shows X') to create concrete feedback loops.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~300+ lines. Much of the content describes general software engineering practices (A/B testing, staged rollouts, version management) and agent optimization concepts that Claude already knows. Lists like 'correction patterns, clarification requests, task abandonment' are obvious categories that don't need enumeration. The skill explains concepts rather than providing actionable, novel instructions. | 1 / 3 |
Actionability | Despite its length, the skill contains no executable code or concrete commands. The code blocks are pseudocode or placeholder templates (e.g., 'Use: context-manager', 'Use: prompt-engineer', 'Use: parallel-test-runner') referencing tools that aren't real or defined. Metrics templates use placeholder brackets like '[X%]'. Nothing is copy-paste ready or directly executable. | 1 / 3 |
Workflow Clarity | The four-phase structure provides a clear sequence (analyze → improve → test → deploy), and the rollback procedures include explicit triggers and steps. However, validation checkpoints between phases are implicit rather than explicit, and the feedback loop between testing failure and re-optimization is not clearly articulated as a concrete decision point. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with no references to external files. All content is inline despite being far too long for a single SKILL.md. Phases 2-4 could each be separate reference documents. There are no links to supplementary materials, examples files, or detailed guides. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
d739c8b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.