CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/eval-setup

Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins

90

3.37x
Quality

90%

Does it follow best practices?

Impact

91%

3.37x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that excels across all dimensions. It provides specific concrete actions, uses natural terminology that ML practitioners would use, includes an explicit 'Use when...' clause with multiple trigger scenarios, and occupies a distinct niche that combines evaluation pipelines with multi-agent and git-based workflows.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Generate eval scenarios from repo commits', 'configure multi-agent runs', 'execute baseline + with-context evals', and 'compare results'. These are distinct, actionable capabilities.

3 / 3

Completeness

Clearly answers both what (generate eval scenarios, configure runs, execute evals, compare results) AND when with explicit 'Use when...' clause covering multiple trigger scenarios (evaluation pipelines, benchmarks, agent performance comparison, test scenario generation).

3 / 3

Trigger Term Quality

Includes natural keywords users would say: 'evaluation pipelines', 'running benchmarks', 'comparing agent performance', 'test scenarios', 'git history', 'models'. Good coverage of terms an ML/AI practitioner would use.

3 / 3

Distinctiveness Conflict Risk

Highly specific niche combining evaluation/benchmarking with multi-agent systems and git-based scenario generation. The combination of 'eval scenarios from repo commits' and 'multi-agent runs' creates a distinct identity unlikely to conflict with general testing or git skills.

3 / 3

Total

12

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with excellent workflow clarity and concrete executable commands throughout. The main weaknesses are moderate verbosity (explaining concepts like what makes good commits in detail) and keeping all content in one large file rather than using progressive disclosure to reference files. The validation checkpoints and error recovery guidance are particularly strong.

Suggestions

Trim the commit selection guidance in Phase 2 - Claude already knows what makes a substantive commit vs a trivial one; a brief list of skip/prioritize criteria would suffice

Move the agent/model table and quality-check anti-patterns to separate reference files (e.g., AGENTS.md, QUALITY_CHECKS.md) and link to them from the main skill

DimensionReasoningScore

Conciseness

The skill is comprehensive but includes some unnecessary verbosity, such as explaining what good/bad commits look like in detail and repeating time expectations multiple times. Some sections could be tightened without losing clarity.

2 / 3

Actionability

Provides fully executable bash commands throughout, specific CLI syntax with all required flags, concrete examples of output formats, and copy-paste ready commands for every step of the workflow.

3 / 3

Workflow Clarity

Excellent multi-phase workflow with clear sequencing (7 phases), explicit validation checkpoints (verify download, quality-check scenarios, poll for completion), and feedback loops (retry on failure, offer to regenerate). Each phase has numbered sub-steps.

3 / 3

Progressive Disclosure

Content is well-structured with clear phases and sections, but everything is in a single monolithic file. The companion skill reference is good, but detailed content like the agent/model table and quality-check anti-patterns could be split into reference files.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

Table of Contents