Name: experiments/eval-improve
Author: experiments

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Review — 71%

Does it follow best practices?

Evaluation — 100%

↑ 1.02x

Agent success when using this tile

Validation — 11 / 11 Passed

Validation for skill structure

{
  "context": "Tests whether the agent correctly classifies eval criteria into four distinct performance buckets using the 80% threshold rules, and whether the analysis includes the right diagnosis, priority signals, and action recommendations for each bucket type.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Bucket A: idempotency key",
      "description": "The 'Stripe idempotency key' criterion (13/15 = 87% with-context, significantly above 2/15 baseline) is classified as working well / no action needed",
      "max_score": 8
    },
    {
      "name": "Bucket B: webhook signature",
      "description": "The 'Webhook signature validation' criterion (5/10 = 50% with-context, low baseline 1/10) is classified as a tile gap that needs a fix",
      "max_score": 8
    },
    {
      "name": "Bucket C: HTTP status codes",
      "description": "The 'HTTP status code handling' criterion (9/10 baseline already high, >= 80% without tile) is classified as redundant / agents already know this",
      "max_score": 8
    },
    {
      "name": "Bucket B: currency precision",
      "description": "The 'Currency precision' criterion (7/15 = 47% with-context, low baseline 3/15) is classified as a tile gap",
      "max_score": 8
    },
    {
      "name": "Bucket D: API version pinning",
      "description": "The 'API version pinning' criterion (with-context 4/10 is LOWER than baseline 6/10) is classified as a regression",
      "max_score": 15
    },
    {
      "name": "Bucket D highest priority",
      "description": "The regression criterion (API version pinning) is explicitly marked as highest priority or most urgent to fix",
      "max_score": 10
    },
    {
      "name": "Bucket B diagnosis present",
      "description": "Both Bucket B criteria include a diagnosis (what the tile is likely missing) and a reference to which tile file should be updated",
      "max_score": 15
    },
    {
      "name": "Bucket C action suggested",
      "description": "The Bucket C criterion (HTTP status codes) is flagged for potential removal from criteria.json or for making the eval task harder",
      "max_score": 10
    },
    {
      "name": "Bucket A no-action",
      "description": "The Bucket A criterion (idempotency key) is noted as leave alone / no action needed / tile's strength",
      "max_score": 8
    },
    {
      "name": "80% threshold applied",
      "description": "Classifications are consistent with the 80% of max threshold (e.g., 87% passes Bucket A; 50%, 47% fall into Bucket B; 90% baseline triggers Bucket C)",
      "max_score": 10
    }
  ]
}

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0

evals

scenario-1

rubric.json

task.md

scenario-2

scenario-3

scenario-4

scenario-5

skills

README.md

tile.json

experiments/eval-improve

rubric.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-1/

rubric.jsonevals/scenario-1/