CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/review-model-performance

Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal

96

1.65x

Quality

97%

Does it follow best practices?

Impact

96%

1.65x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

rubric.jsonevals/scenario-2/

{
  "context": "Tests whether the agent correctly scripts or orchestrates a multi-model eval execution run, including using the correct command format, running models sequentially, capturing and tracking run IDs, providing monitoring URLs, polling for completion, and handling failures.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Correct base command",
      "description": "Uses tessl eval run as the base command (not tessl run eval, tessl eval start, or other variants)",
      "max_score": 8
    },
    {
      "name": "--agent flag format",
      "description": "Uses --agent=claude:<model> format (e.g., --agent=claude:claude-haiku-4-5), not --model or other flag names",
      "max_score": 10
    },
    {
      "name": "All three default models",
      "description": "Includes all three model identifiers: claude-haiku-4-5, claude-sonnet-4-6, and claude-opus-4-6",
      "max_score": 10
    },
    {
      "name": "Sequential execution",
      "description": "Models are run sequentially — there is no parallel execution (no & operator, no background jobs, no parallel/xargs constructs) between the three model runs",
      "max_score": 15
    },
    {
      "name": "Run ID capture",
      "description": "The script captures or stores the run ID returned from each tessl eval run invocation",
      "max_score": 10
    },
    {
      "name": "Model-to-ID mapping",
      "description": "Run IDs are stored mapped to their corresponding model names (e.g., as variables named by model, or in an associative structure)",
      "max_score": 8
    },
    {
      "name": "Monitoring URL output",
      "description": "The script outputs or constructs a monitoring URL using the pattern https://tessl.io/eval-runs/<id>",
      "max_score": 8
    },
    {
      "name": "Polls with tessl eval view",
      "description": "Uses tessl eval view <id> to check run status (not tessl status, tessl check, or other commands)",
      "max_score": 10
    },
    {
      "name": "Retry on failure",
      "description": "Handles run failure by calling tessl eval retry <id> (not by re-running tessl eval run from scratch)",
      "max_score": 8
    },
    {
      "name": "Waits for all to complete",
      "description": "Script waits until all three model runs reach Completed status before exiting or proceeding to any next step",
      "max_score": 7
    },
    {
      "name": "No --workspace flag",
      "description": "The eval run commands do NOT include a --workspace flag",
      "max_score": 6
    }
  ]
}

evals

tile.json