Name: tessl-labs/eval-setup
Rating: 90.1 (1 reviews)
Author: tessl-labs

tessl-labs/eval-setup

Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins

3.37x

Quality

90%

Does it follow best practices?

Impact

91%

3.37x

Average score across 2 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that excels across all dimensions. It provides specific concrete actions, uses natural terminology that ML practitioners would use, includes an explicit 'Use when...' clause with multiple trigger scenarios, and occupies a distinct niche that combines evaluation pipelines with multi-agent and git-based workflows.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Generate eval scenarios from repo commits', 'configure multi-agent runs', 'execute baseline + with-context evals', and 'compare results'. These are distinct, actionable capabilities.	3 / 3
Completeness	Clearly answers both what (generate eval scenarios, configure runs, execute evals, compare results) AND when with explicit 'Use when...' clause covering multiple trigger scenarios (evaluation pipelines, benchmarks, agent performance comparison, test scenario generation).	3 / 3
Trigger Term Quality	Includes natural keywords users would say: 'evaluation pipelines', 'running benchmarks', 'comparing agent performance', 'test scenarios', 'git history', 'models'. Good coverage of terms an ML/AI practitioner would use.	3 / 3
Distinctiveness Conflict Risk	Highly specific niche combining evaluation/benchmarking with multi-agent systems and git-based scenario generation. The combination of 'eval scenarios from repo commits' and 'multi-agent runs' creates a distinct identity unlikely to conflict with general testing or git skills.	3 / 3
	Total	12 / 12 Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with excellent workflow clarity and concrete executable commands throughout. The main weaknesses are moderate verbosity (explaining concepts like what makes good commits in detail) and keeping all content in one large file rather than using progressive disclosure to reference files. The validation checkpoints and error recovery guidance are particularly strong.

Suggestions

Trim the commit selection guidance in Phase 2 - Claude already knows what makes a substantive commit vs a trivial one; a brief list of skip/prioritize criteria would suffice

Move the agent/model table and quality-check anti-patterns to separate reference files (e.g., AGENTS.md, QUALITY_CHECKS.md) and link to them from the main skill

Dimension	Reasoning	Score
Conciseness	The skill is comprehensive but includes some unnecessary verbosity, such as explaining what good/bad commits look like in detail and repeating time expectations multiple times. Some sections could be tightened without losing clarity.	2 / 3
Actionability	Provides fully executable bash commands throughout, specific CLI syntax with all required flags, concrete examples of output formats, and copy-paste ready commands for every step of the workflow.	3 / 3
Workflow Clarity	Excellent multi-phase workflow with clear sequencing (7 phases), explicit validation checkpoints (verify download, quality-check scenarios, poll for completion), and feedback loops (retry on failure, offer to regenerate). Each phase has numbered sub-steps.	3 / 3
Progressive Disclosure	Content is well-structured with clear phases and sections, but everything is in a single monolithic file. The companion skill reference is good, but detailed content like the agent/model table and quality-check anti-patterns could be split into reference files.	2 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

about 1 month ago

Table of Contents

Discovery Implementation Validation