Eval-driven process for improving best-practice skills — analyse eval results, research what agents get wrong, rewrite for maximum uplift, and measure improvement with scenarios.
84
84%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
An eval-driven process for improving tessl skills. Follow these phases in order. Each phase has a clear goal and exit criteria before moving to the next.
Goal: Understand what the skill currently teaches and how it's structured.
tessl install <workspace/tile>Read three things:
verifiers/*.json) — how many checklist items? What do they cover vs what the skill teaches?Note which topics get the most space. This is where the author thought the value was — the evals will tell you if they were right.
Goal: Identify where the skill helps, where it doesn't, and where it fails.
tessl eval list --tile <workspace/tile>
tessl eval view <most-recent-eval-id> --jsonBuild an uplift table from the results:
| Scenario | Baseline | With Skill | Uplift |
|---|---|---|---|
| ... | ... | ... | ... |
Categorize each scenario:
Drill into item-level scores within each scenario. These reveal exactly which practices agents miss without the skill and which they still miss even with it.
You can explain, for every scenario: where the skill helps, where it doesn't, and why.
Goal: Find practices that agents commonly get wrong that the skill doesn't yet cover.
Think about what catches people in production with this technology:
For each candidate, answer three questions:
Keep only candidates that pass all three.
For each practice, find:
References validate that a practice is real (not invented) and give agents a path to dig deeper when needed.
You have a prioritized list of 3-8 new practices, each with: a one-line description, why it matters, and at least one reference link.
Goal: Produce a SKILL.md that maximizes uplift per token.
Allocate space by uplift, not by topic breadth. If pool configuration is where all the uplift comes from, it gets the most space. If transactions already score 99% baseline, they get a short section.
Lead with code. Agents learn patterns from examples more effectively than from prose. For each practice:
// RIGHT — description of correct pattern
<code example>
// WRONG — description of what to avoid
<code example>
One line explaining why.Be specific about values. Don't say "configure timeouts" — say connectionTimeoutMillis: 5000. Agents copy what they see in the skill. Vague guidance produces vague output.
Trim low-uplift sections to their essentials. Don't remove them entirely — they may help in edge cases — but don't let them dominate the token budget.
Add a References section at the bottom with links to docs, issues, and posts.
Update the Checklist to cover every practice the skill teaches.
# <Technology> Best Practices
One-line summary.
---
## <Highest-Uplift Topic>
<Code example with configuration>
### Key rules
- Bullet points with specific values and reasons
---
## <Next-Highest-Uplift Topic>
<RIGHT vs WRONG code examples>
One-line explanation.
---
## <Remaining topics, ordered by expected uplift>
...
---
## Checklist
- [ ] Item per practice, specific enough to verify
---
## References
- [Link text](url) — one-line description of what it coversThe rewritten SKILL.md covers all high-uplift practices with specific code examples, trims low-uplift content, and includes references.
Goal: Ensure the verifier checklist covers every practice the skill teaches.
Edit the verifier JSON in verifiers/. For each practice, add a checklist item:
{
"name": "short-identifier",
"rule": "Agent does X (stated as positive assertion with specific details)",
"relevant_when": "Agent is doing Y (scoping condition)"
}keepAlive: true, the checklist item should check for keepAlive, not just "production configuration".Every practice in the SKILL.md checklist has a corresponding verifier checklist item.
Goal: Confirm no regression on what was already working.
tessl tile lint <path>
tessl tile publish --bump patch <path>If scenarios aren't already downloaded locally:
tessl scenario list --workspace <workspace> --json
# Find the generation ID for this tile
tessl scenario download <generation-id> --output <path>/evalsRun evals:
tessl eval run --label "<version> - <brief description>" <path>
tessl eval view <eval-id>All previously-passing scenarios still pass. No regressions.
Goal: Measure uplift on the new practice areas.
tessl scenario generate --count <N> <path>
# Wait for completion
tessl scenario view <generation-id>
# Download when complete
tessl scenario download <generation-id> --output <path>/evals
# Run full eval (original + new scenarios)
tessl eval run --label "<version> - new scenarios" <path>Request roughly one scenario per major new practice area.
Build the full uplift table across all scenarios (original + new):
| Scenario | Baseline | With Skill | Uplift |
|---|---|---|---|
| ... | ... | ... | ... |
New scenarios with +10 or more uplift — confirms the new practices are teaching agents something they didn't know. This is the primary success metric.
New scenarios with low uplift — the practice may not be as commonly missed as expected, or the skill's guidance isn't clear enough. But first check whether the task description is leaking the answer (see Phase 8).
With-context scores below 95% — drill into item-level scores. The skill is trying to teach this but the agent isn't learning it from the current wording. This is the input for Phase 9.
You have uplift measurements for every practice area. You know which practices delivered value and which didn't.
Goal: Ensure eval scenarios are measuring skill value, not task leakage.
After the first full eval run, read every task.md and check for answer leakage — implementation hints in the task description that inflate the baseline and hide the skill's real contribution.
For each scenario, compare the baseline score against what the task description reveals. A high baseline (85%+) on a practice the skill is meant to teach is a red flag. Read the task and ask: could an agent pass these criteria just by following the hints in the task description, without the skill?
Common leakage patterns:
A good task describes:
A good task does NOT describe:
Benchmark: Scenario 7 (bulk inserts). Task says "took 90 seconds for 2,000 items because of individual round trips, make it faster." Describes the problem without naming unnest. Baseline: 56%. Uplift: +44. This is what a well-written task looks like.
The most valuable thing a best-practice skill can do is make an agent proactively apply a practice when the task doesn't ask for it. This is the difference between:
The best scenarios describe a business requirement and let the criteria check whether the agent applied the best practice without being told to. The skill's job is to make the agent think "this is a POST endpoint that creates resources — I should make it idempotent" or "this is behind a reverse proxy — I need ProxyFix."
How to write proactive scenarios:
How to detect instructed scenarios: If the task mentions the practice by name (e.g., "add rate limiting", "implement idempotency", "configure CORS"), the baseline agent will do it too. The scenario is testing implementation skill, not judgment. These scenarios have high baselines and low uplift — the skill can't add value because the task already told the agent what to do.
The litmus test: Read the task and ask: "Would a junior developer who has never heard of [practice X] know to add it based on this task description?" If yes, the task is instructing. If no, the task is testing proactive application — and that's where skills shine.
Example — idempotency:
Rewrite leaky task.md files, then re-run evals. Expect:
Also read every criteria.json and check for:
= ANY($1::text[]) but the correct pattern for a PostgreSQL array column is the && (overlap) operator. Fix by broadening the description to accept all valid approaches.check_violation error handling when the schema has no CHECK constraints. Fix by adding context to the task that makes the practice relevant.Task descriptions describe problems without prescribing solutions. Criteria accept all valid implementation approaches. Baselines reflect what agents actually know, not what the task told them.
Goal: Close remaining gaps where with-context score is below 95%.
For each underperforming checklist item, diagnose which of these four root causes applies:
The skill doesn't cover the practice, or covers it too vaguely. Fix by adding a specific code example to the SKILL.md.
The skill shows one version of a pattern but the scenario needs a different variant. Example: skill shows ANY($1::int[]) but the scenario involves a text[] array column needing the && operator. Fix by adding variant examples to the skill.
The agent's output is correct but the criteria description is too narrow. Fix by broadening the criteria to accept alternative valid approaches. Don't change the skill to match a bad rubric.
The with-context score is lower than baseline (regression). The skill's content led the agent away from the right answer. Check for:
Fix by clarifying, not adding more content. Sometimes removing or shortening a section helps more than expanding it.
All scenarios score 95%+ with context, or remaining gaps are explained by criteria issues rather than skill deficiencies.
err.code === '23505' for unique violations" teaches a specific pattern.= ANY for scalar columns vs && for array columns), show both. Agents apply exactly what the skill demonstrates.