Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal
96
Quality
97%
Does it follow best practices?
Impact
96%
1.65xAverage score across 3 eval scenarios
Passed
No known issues
Pre-eval setup verification
Excludes .tessl cache
0%
90%
.tessl/tiles warning
0%
90%
Scenario existence check
30%
100%
Scenario generation guidance
0%
100%
Login verification
30%
100%
No --workspace flag
100%
100%
Default model names
0%
100%
Model subset confirmation
0%
25%
Time estimate provided
16%
100%
Run count option
0%
100%
Sequential multi-model eval execution
Correct base command
100%
100%
--agent flag format
70%
100%
All three default models
100%
100%
Sequential execution
33%
100%
Run ID capture
100%
100%
Model-to-ID mapping
100%
100%
Monitoring URL output
25%
100%
Polls with tessl eval view
100%
100%
Retry on failure
100%
100%
Waits for all to complete
100%
100%
No --workspace flag
100%
100%
Multi-model comparison analysis and reporting
Overall summary table
100%
100%
Per-scenario breakdown
100%
100%
Per-criterion table
100%
100%
Correct symbol thresholds
0%
70%
Baseline interpretation
100%
100%
Pattern A identified
90%
100%
Pattern B identified
80%
100%
Pattern D identified
100%
100%
Fix before publish recommendation
100%
100%
eval-improve mentioned
0%
100%
Re-run offer
100%
100%