Name: tessl-labs/review-model-performance
Rating: 96.8 (1 reviews)
Author: tessl-labs

tessl-labs/review-model-performance

Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal

1.65x

Quality

97%

Does it follow best practices?

Impact

96%

1.65x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that excels across all dimensions. It clearly specifies concrete actions (running evals, comparing results, identifying gaps), includes natural trigger terms users would actually say, explicitly states both what the skill does and when to use it, and occupies a distinct niche that won't conflict with other skills.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Run task evals', 'compare results side-by-side', 'identify model-specific gaps', 'validate a skill before publishing'. These are clear, actionable capabilities.	3 / 3
Completeness	Clearly answers both what ('Run task evals across multiple Claude models and compare results side-by-side') AND when ('Use when you want to understand how a skill performs across different models, identify model-specific gaps versus universal tile issues, or validate a skill before publishing').	3 / 3
Trigger Term Quality	Includes natural keywords users would say: 'task evals', 'Claude models', 'compare', 'model-specific gaps', 'skill', 'registry', 'publishing'. Good coverage of domain-specific terms a user working with evals would naturally use.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive niche combining 'task evals', 'multiple Claude models', 'side-by-side comparison', and 'skill registry'. Unlikely to conflict with other skills due to the specific combination of eval comparison and model testing context.	3 / 3
	Total	12 / 12 Passed

Implementation

92%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a high-quality skill with excellent actionability and workflow clarity. The 7-phase structure with explicit validation checkpoints, error handling, and clear decision points makes it easy to follow. The skill efficiently assumes Claude's competence while providing concrete commands and output formats throughout.

Dimension	Reasoning	Score
Conciseness	The skill is lean and efficient throughout. It assumes Claude's competence with bash commands, eval concepts, and table formatting. No unnecessary explanations of what evals are or how Tessl works—every section delivers actionable content.	3 / 3
Actionability	Fully executable commands throughout with specific syntax (e.g., `tessl eval run <path/to/tile> --agent=claude:<model>`). Clear copy-paste ready examples for finding tiles, generating scenarios, polling, and publishing. Output table formats are concrete and specified.	3 / 3
Workflow Clarity	Excellent multi-phase workflow with clear sequencing (7 phases). Includes explicit validation checkpoints (verify scenarios exist, verify login, poll for completion), error handling (retry failed runs), and decision points (confirm models, number of runs). Feedback loops are present for failures.	3 / 3
Progressive Disclosure	Content is well-organized with clear phases and subsections, but it's a monolithic document (~200 lines) with no references to external files. The diagnosis patterns and table formats could be split into reference files. However, for a procedural skill this length, inline content is reasonable.	2 / 3
	Total	11 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

3 months ago

Table of Contents

Discovery Implementation Validation