Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
94
Quality
89%
Does it follow best practices?
Impact
98%
1.30xAverage score across 7 eval scenarios
Passed
No known issues
{
"context": "Testing whether an agent following the eval-improve skill reads relevant tile files before fixing, proposes changes before applying them, makes minimal targeted edits, commits before re-running, and includes --workspace in the re-run command.",
"type": "weighted_checklist",
"checklist": [
{
"name": "reads_tile_files_before_fixing",
"description": "Before proposing any fix, the agent reads the relevant tile files (SKILL.md, rules/*.md, docs/*.md) to understand the current content and identify the root cause.",
"max_score": 2
},
{
"name": "proposes_before_applying",
"description": "The agent shows the user which file it will edit, what text it will add or change, and why this addresses the failing criterion — before making any edits.",
"max_score": 2
},
{
"name": "targeted_fix_not_rewrite",
"description": "The agent makes minimal targeted edits (adding the missing backoff pattern, clarifying the auth URL instruction) without rewriting or restructuring sections that are working well.",
"max_score": 2
},
{
"name": "commits_before_rerun",
"description": "The agent commits changes before running `tessl eval run`, since the eval picks up the committed version of files. Running eval before committing would not test the fixes.",
"max_score": 3
},
{
"name": "workspace_in_eval_run",
"description": "The agent includes `--workspace <name>` in the `tessl eval run ./evals/` command.",
"max_score": 1
}
]
}