CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/review-model-performance

Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal

96

1.65x

Quality

97%

Does it follow best practices?

Impact

96%

1.65x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

92%

76%

Tile Eval Readiness Checker

Pre-eval setup verification

Criteria
Without context
With context

Excludes .tessl cache

0%

90%

.tessl/tiles warning

0%

90%

Scenario existence check

30%

100%

Scenario generation guidance

0%

100%

Login verification

30%

100%

No --workspace flag

100%

100%

Default model names

0%

100%

Model subset confirmation

0%

25%

Time estimate provided

16%

100%

Run count option

0%

100%

100%

19%

Multi-Model Tile Benchmark Automation

Sequential multi-model eval execution

Criteria
Without context
With context

Correct base command

100%

100%

--agent flag format

70%

100%

All three default models

100%

100%

Sequential execution

33%

100%

Run ID capture

100%

100%

Model-to-ID mapping

100%

100%

Monitoring URL output

25%

100%

Polls with tessl eval view

100%

100%

Retry on failure

100%

100%

Waits for all to complete

100%

100%

No --workspace flag

100%

100%

97%

18%

Model Benchmark Comparison Report

Multi-model comparison analysis and reporting

Criteria
Without context
With context

Overall summary table

100%

100%

Per-scenario breakdown

100%

100%

Per-criterion table

100%

100%

Correct symbol thresholds

0%

70%

Baseline interpretation

100%

100%

Pattern A identified

90%

100%

Pattern B identified

80%

100%

Pattern D identified

100%

100%

Fix before publish recommendation

100%

100%

eval-improve mentioned

0%

100%

Re-run offer

100%

100%

Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents