Name: tdg-personal/agent-eval
Rating: 57.599999999999994 (1 reviews)
Author: tdg-personal

tdg-personal/agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

Quality

72%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong in specificity and distinctiveness, clearly naming the tools being compared and the metrics tracked. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know exactly when to select this skill. Adding a few more natural trigger terms (e.g., 'benchmark', 'evaluate', 'which coding agent is best') would also improve discoverability.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to benchmark, evaluate, or compare coding agents or AI coding assistants.'

Include additional natural trigger terms users might say, such as 'benchmark', 'evaluate', 'test', 'which coding tool is best', 'agent comparison', or 'AI coding assistant performance'.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: head-to-head comparison of coding agents, custom tasks, and specific metrics (pass rate, cost, time, consistency). Names specific tools (Claude Code, Aider, Codex).	3 / 3
Completeness	Clearly answers 'what does this do' (head-to-head comparison of coding agents with specific metrics), but lacks an explicit 'Use when...' clause or equivalent trigger guidance, which caps this at 2 per the rubric.	2 / 3
Trigger Term Quality	Includes good natural keywords like 'coding agents', 'Claude Code', 'Aider', 'Codex', 'comparison', 'benchmark', and metric names. However, it misses common user phrasings like 'benchmark', 'evaluate', 'test agents', 'which is better', or 'compare AI coding tools'.	2 / 3
Distinctiveness Conflict Risk	Very distinct niche — benchmarking coding agents against each other with specific metrics. Unlikely to conflict with general coding skills, code review skills, or other tool-specific skills.	3 / 3
	Total	10 / 12 Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured skill with strong actionability — the YAML examples, CLI commands, and output tables are concrete and immediately usable. The main weaknesses are the lack of validation/error-handling steps in the workflow and some verbosity in the conceptual explanations that Claude doesn't need. The best practices section adds genuine value with non-obvious guidance about trial counts and deterministic judges.

Suggestions

Add validation checkpoints to the workflow: e.g., how to validate task YAML before running, what to do when an agent run fails or times out, and how to verify worktree creation succeeded.

Trim the introductory paragraph and 'Core Concepts > Git Worktree Isolation' section — Claude doesn't need the motivational framing or explanation of what worktree isolation provides.

Dimension	Reasoning	Score
Conciseness	The content is mostly efficient but includes some unnecessary framing ('Every comparison runs on vibes — this tool systematizes it') and explanatory text that could be trimmed. The 'Core Concepts' section explaining git worktree isolation and metrics collected is somewhat verbose for what Claude needs to know to use the tool.	2 / 3
Actionability	Provides concrete, copy-paste ready CLI commands, complete YAML task definitions, and specific examples for all judge types. The workflow steps include executable commands with realistic arguments and expected output.	3 / 3
Workflow Clarity	The 3-step workflow (define tasks → run agents → compare results) is clearly sequenced with concrete commands, but there are no validation checkpoints or error recovery steps. What happens if an agent fails to start, if the worktree creation fails, or if judge criteria are malformed? No feedback loops are present for these failure modes.	2 / 3
Progressive Disclosure	The content is reasonably structured with clear sections, but it's somewhat monolithic — the judge types section and best practices could potentially be split into separate reference files. The single external link to the repository is present but there are no references to supplementary docs for advanced configuration or troubleshooting.	2 / 3
	Total	9 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Reviewed

about 1 month ago

Table of Contents

Discovery Implementation Validation