Name: tessl-labs/review-model-performance
Rating: 0.968 (1 reviews)
Author: tessl-labs

tessl-labs/review-model-performance

Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal

1.65x

Quality

97%

Does it follow best practices?

Impact

96%

1.65x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Evaluation results

92%

76%

Tile Eval Readiness Checker

Pre-eval setup verification

Criteria

Without context

With context

Excludes .tessl cache

90%

.tessl/tiles warning

90%

Scenario existence check

30%

100%

Scenario generation guidance

100%

30%

100%

No --workspace flag

100%

Default model names

100%

Model subset confirmation

25%

Time estimate provided

16%

100%

Run count option

100%

19%

Multi-Model Tile Benchmark Automation

Sequential multi-model eval execution

Criteria

Without context

With context

Correct base command

100%

--agent flag format

70%

100%

All three default models

100%

Sequential execution

33%

100%

Run ID capture

100%

Model-to-ID mapping

100%

Monitoring URL output

25%

100%

Polls with tessl eval view

100%

Retry on failure

100%

Waits for all to complete

100%

No --workspace flag

100%

97%

18%

Model Benchmark Comparison Report

Multi-model comparison analysis and reporting

Criteria

Without context

With context

Overall summary table

100%

Per-scenario breakdown

100%

Per-criterion table

100%

Correct symbol thresholds

70%

Baseline interpretation

100%

Pattern A identified

90%

100%

Pattern B identified

80%

100%

Pattern D identified

100%

Fix before publish recommendation

100%

eval-improve mentioned

100%

Re-run offer

100%

Evaluated: 1 day ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Tile Eval Readiness Checker Multi-Model Tile Benchmark Automation Model Benchmark Comparison Report