tessl-labs/review-model-performance

Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal

1.65x

Quality

97%

Does it follow best practices?

Impact

96%

1.65x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Multi-Model Tile Benchmark Automation

Name: tessl-labs/review-model-performance
Rating: 0.968 (1 reviews)
Author: tessl-labs

Problem Description

A platform team maintains a growing library of Tessl tiles and needs to reliably benchmark each tile's skill quality across the range of Claude models their customers use. The process is currently done manually — engineers run commands one at a time and lose track of job IDs when multiple models are running. Benchmarks are sometimes kicked off in parallel by mistake, causing noisy results and billing surprises, and when a job fails it often goes unnoticed until the engineer checks back hours later.

The team wants a single, self-contained shell script that automates the full execution phase of a multi-model benchmark: starting all model runs in the correct order, tracking each job, providing links for real-time monitoring, polling until every run finishes, and handling job failures automatically. The tile being benchmarked is located at ./analytics-tile and already has eval scenarios in place.

Output Specification

Produce one file:

benchmark.sh — An executable shell script that runs the full benchmark. It should print progress updates as each job starts and completes. It should exit with a non-zero status if any run cannot be successfully completed even after a retry attempt.

The script should be written for bash and should work on a standard Linux environment with the tessl CLI available on PATH.

evals

scenario-1

scenario-2

rubric.json

task.md

scenario-3

skills

tile.json

tessl-labs/review-model-performance

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-2/

Multi-Model Tile Benchmark Automation

Problem Description

Output Specification

task.mdevals/scenario-2/