CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

94

1.30x

Quality

89%

Does it follow best practices?

Impact

98%

1.30x

Average score across 7 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

rubric.jsonevals/scenario-1/

{
  "context": "Testing whether an agent following the eval-improve skill reads relevant tile files before fixing, proposes changes before applying them, makes minimal targeted edits, commits before re-running, and includes --workspace in the re-run command.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "reads_tile_files_before_fixing",
      "description": "Before proposing any fix, the agent reads the relevant tile files (SKILL.md, rules/*.md, docs/*.md) to understand the current content and identify the root cause.",
      "max_score": 2
    },
    {
      "name": "proposes_before_applying",
      "description": "The agent shows the user which file it will edit, what text it will add or change, and why this addresses the failing criterion — before making any edits.",
      "max_score": 2
    },
    {
      "name": "targeted_fix_not_rewrite",
      "description": "The agent makes minimal targeted edits (adding the missing backoff pattern, clarifying the auth URL instruction) without rewriting or restructuring sections that are working well.",
      "max_score": 2
    },
    {
      "name": "commits_before_rerun",
      "description": "The agent commits changes before running `tessl eval run`, since the eval picks up the committed version of files. Running eval before committing would not test the fixes.",
      "max_score": 3
    },
    {
      "name": "workspace_in_eval_run",
      "description": "The agent includes `--workspace <name>` in the `tessl eval run ./evals/` command.",
      "max_score": 1
    }
  ]
}

evals

scenario-1

rubric.json

task.md

README.md

tile.json