Name: experiments/eval-setup
Rating: 77.60000000000001 (1 reviews)
Author: experiments

experiments/eval-setup

Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins

Quality

97%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Quality

Content

92%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a high-quality skill with excellent actionability and workflow clarity. The 7-phase structure provides clear guidance through a complex multi-step process with appropriate validation checkpoints and error recovery (retry on failure). The main weakness is that the document is quite long and could benefit from splitting reference material into separate files for better progressive disclosure.

Suggestions

Consider moving the agent/model compatibility table to a separate AGENTS.md reference file to reduce main skill length

The Phase 5-7 content could potentially be split into a separate RUNNING.md file, with SKILL.md focusing on setup (Phases 1-4) and linking to running/analysis

Dimension	Reasoning	Score
Conciseness	The skill is lean and efficient, providing only necessary commands and context. It assumes Claude's competence with CLI tools and doesn't explain basic concepts like what evals are or how git commits work.	3 / 3
Actionability	Every phase includes specific, copy-paste ready CLI commands with clear flag syntax. The skill provides concrete examples like `--agent=claude:claude-sonnet-4-6` and exact file paths like `evals/*/task.md`.	3 / 3
Workflow Clarity	The 7-phase workflow is clearly sequenced with explicit validation checkpoints (polling for completion, verifying downloads, retry on failure). Each phase has numbered sub-steps and clear decision points with user confirmation.	3 / 3
Progressive Disclosure	The content is well-structured with clear phases and sections, but it's a long monolithic document (~200 lines) that could benefit from splitting detailed reference content (like the agent/model table) into separate files. The companion skill reference is good but inline content is heavy.	2 / 3
	Total	11 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong, well-crafted skill description that excels across all dimensions. It provides specific concrete actions, includes natural trigger terms users would say, explicitly addresses both what the skill does and when to use it, and has a clear distinctive niche around Tessl tile evaluation that minimizes conflict risk with other skills.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Generates eval scenarios from repo commit history', 'configures single or multi-agent runs', 'executes baseline and with-context evals', 'compares results across agents and scenarios'. These are detailed, actionable capabilities.	3 / 3
Completeness	Clearly answers both what (sets up eval pipelines, generates scenarios, configures runs, executes evals, compares results) AND when with explicit 'Use when...' clause covering multiple trigger scenarios including 'setting up evaluation pipelines', 'benchmarking tile performance', 'generating test scenarios from git history'.	3 / 3
Trigger Term Quality	Includes natural keywords users would say: 'evaluation pipelines', 'benchmarking', 'tile performance', 'agent performance', 'test scenarios', 'git history', 'performance assessments', 'historical commits'. Good coverage of domain-specific terms users would naturally use.	3 / 3
Distinctiveness Conflict Risk	Highly specific niche around 'Tessl tiles' evaluation with distinct triggers like 'eval pipelines', 'tile performance', 'multi-agent runs', and 'with-context evals'. The domain-specific terminology (Tessl tiles, baseline vs with-context evals) makes conflicts with other skills unlikely.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

4 months ago

Table of Contents

Discovery Implementation Validation