CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

94

1.30x

Quality

89%

Does it follow best practices?

Impact

98%

1.30x

Average score across 7 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-1/

Analysis of eval results has identified two failing criteria:

  • "uses_retry_backoff" — Bucket B (tile gap): both baseline and with-context score 0/4. The skill never mentions the retry backoff pattern the rubric checks for.
  • "auth_url_capture" — Bucket D (regression): baseline was 3/5, with-context is 1/5. Something in the tile is actively confusing the agent on auth URL handling.

The user says: "Go ahead and fix these. Show me what you're going to change before you make the edits."

Fix both issues, commit the changes, and re-run the evals.

evals

scenario-1

rubric.json

task.md

README.md

tile.json