Name: experiments/eval-improve
Author: experiments

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Review — 71%

Does it follow best practices?

Evaluation — 100%

↑ 1.02x

Agent success when using this tile

Validation — 11 / 11 Passed

Validation for skill structure

{
  "context": "Tests whether the agent correctly identifies a regression as caused by ambiguous or contradictory content in the tile (rather than missing content), and proposes a removal or clarification fix rather than adding more emphasis on the failing behavior.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Contradicting clause identified",
      "description": "The analysis identifies the 'documentation-only or clearly trivial... you may skip the test run' sentence (or equivalent 'at your discretion' exception) as the specific cause of the regression",
      "max_score": 22
    },
    {
      "name": "Contradiction mechanism explained",
      "description": "The analysis explains that this clause contradicts 'Always run the full test suite' and gives agents a justification to skip tests — not just that 'it's confusing'",
      "max_score": 18
    },
    {
      "name": "Remove/clarify approach taken",
      "description": "The proposed fix removes or substantially rewrites the exception clause — NOT just adding another 'always run tests' statement elsewhere to compensate",
      "max_score": 22
    },
    {
      "name": "Specific text targeted",
      "description": "The fix targets the specific problematic sentence(s), not a broad rewrite of the entire Pre-Review Checklist section",
      "max_score": 18
    },
    {
      "name": "No compensating additions",
      "description": "The updated SKILL.md does NOT add a new reinforcing instruction (e.g., 'Note: never skip tests even for trivial changes') alongside or instead of removing the contradictory clause",
      "max_score": 10
    },
    {
      "name": "Other sections preserved",
      "description": "The Submitting for Review and Responding to Feedback sections are not modified in the updated SKILL.md",
      "max_score": 7
    },
    {
      "name": "Pre-review list intact",
      "description": "The numbered list structure of the Pre-Review Checklist is maintained in the updated file — the fix does not replace the entire checklist with prose",
      "max_score": 3
    }
  ]
}

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0

evals

scenario-1

scenario-2

scenario-3

scenario-4

rubric.json

task.md

scenario-5

skills

README.md

tile.json

experiments/eval-improve

rubric.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-4/

rubric.jsonevals/scenario-4/