tessl-labs/best-practice-skill-improver

Eval-driven process for improving best-practice skills — analyse eval results, research what agents get wrong, rewrite for maximum uplift, and measure improvement with scenarios.

Quality

84%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

name:: best-practice-skill-improver
description:: Eval-driven process for improving tessl skills that teach best practices. Use when asked to improve, optimize, or iterate on a tessl tile or skill, or when creating a new best-practice skill from scratch. Covers analysing eval results, identifying high-uplift practices, rewriting skills, updating verifiers, and measuring improvement with scenarios.
keywords:: tessl, skill, tile, eval, optimization, best practices, verifier, scenario, uplift, improvement
license:: MIT

Improving Best-Practice Skills

Name: tessl-labs/best-practice-skill-improver
Rating: 84 (1 reviews)
Author: tessl-labs

An eval-driven process for improving tessl skills. Follow these phases in order. Each phase has a clear goal and exit criteria before moving to the next.

Phase 1: Install and Read

Goal: Understand what the skill currently teaches and how it's structured.

tessl install <workspace/tile>

Read three things:

SKILL.md — what topics does it cover? How much space does each get?
Verifiers (in verifiers/*.json) — how many checklist items? What do they cover vs what the skill teaches?
tile.json — current version, metadata

Note which topics get the most space. This is where the author thought the value was — the evals will tell you if they were right.

Phase 2: Analyse Existing Evals

Goal: Identify where the skill helps, where it doesn't, and where it fails.

tessl eval list --tile <workspace/tile>
tessl eval view <most-recent-eval-id> --json

Build an uplift table from the results:

Scenario	Baseline	With Skill	Uplift
...	...	...	...

Categorize each scenario:

High uplift (10+ points) — the skill is providing real value. Protect and strengthen these sections.
Low/zero uplift — the skill covers this but agents already know it. This content costs tokens without changing behavior. Candidates for trimming.
Partial scores with context (<100%) — the skill is trying to help but failing. Needs clearer examples or more specific guidance.

Drill into item-level scores within each scenario. These reveal exactly which practices agents miss without the skill and which they still miss even with it.

Exit criteria

You can explain, for every scenario: where the skill helps, where it doesn't, and why.

Phase 3: Research High-Uplift Practices

Goal: Find practices that agents commonly get wrong that the skill doesn't yet cover.

Finding candidates

Think about what catches people in production with this technology:

Search GitHub issues on the library's repo — issues with many comments or reactions are pain points the community hits repeatedly
Search for "common mistakes" or "gotchas" blog posts
Check the library's FAQ or wiki — FAQ items exist because people keep asking
Think about type-system edge cases, configuration defaults that are wrong for production, and silent failures

Filtering candidates

For each candidate, answer three questions:

Would an agent get this wrong without guidance? If agents already do it right, skip it.
Does getting it wrong cause a real bug? Not a style issue — a bug, outage, or security vulnerability.
Can it be taught in a concise code example? If it takes 500 words of prose to explain, it's too complex for a skill.

Keep only candidates that pass all three.

Gathering references

For each practice, find:

Official documentation link — the authoritative source
GitHub issues — where the community discusses the problem
Blog posts — that explain the "why" with data or examples

References validate that a practice is real (not invented) and give agents a path to dig deeper when needed.

Exit criteria

You have a prioritized list of 3-8 new practices, each with: a one-line description, why it matters, and at least one reference link.

Phase 4: Rewrite the Skill

Goal: Produce a SKILL.md that maximizes uplift per token.

Principles

Allocate space by uplift, not by topic breadth. If pool configuration is where all the uplift comes from, it gets the most space. If transactions already score 99% baseline, they get a short section.

Lead with code. Agents learn patterns from examples more effectively than from prose. For each practice:

// RIGHT — description of correct pattern
<code example>

// WRONG — description of what to avoid
<code example>

One line explaining why.

Be specific about values. Don't say "configure timeouts" — say connectionTimeoutMillis: 5000. Agents copy what they see in the skill. Vague guidance produces vague output.

Trim low-uplift sections to their essentials. Don't remove them entirely — they may help in edge cases — but don't let them dominate the token budget.

Add a References section at the bottom with links to docs, issues, and posts.

Update the Checklist to cover every practice the skill teaches.

Structure template

# <Technology> Best Practices

One-line summary.

---

## <Highest-Uplift Topic>
<Code example with configuration>
### Key rules
- Bullet points with specific values and reasons

---

## <Next-Highest-Uplift Topic>
<RIGHT vs WRONG code examples>
One-line explanation.

---

## <Remaining topics, ordered by expected uplift>
...

---

## Checklist
- [ ] Item per practice, specific enough to verify

---

## References
- [Link text](url) — one-line description of what it covers

Exit criteria

The rewritten SKILL.md covers all high-uplift practices with specific code examples, trims low-uplift content, and includes references.

Phase 5: Update Verifiers

Goal: Ensure the verifier checklist covers every practice the skill teaches.

Edit the verifier JSON in verifiers/. For each practice, add a checklist item:

{
  "name": "short-identifier",
  "rule": "Agent does X (stated as positive assertion with specific details)",
  "relevant_when": "Agent is doing Y (scoping condition)"
}

Rules for good checklist items

Be specific. "Agent uses parameterized queries" is testable. "Agent writes safe code" is not.
Include the relevant_when. This prevents the item from triggering on unrelated tasks.
Match the skill's actual guidance. If the skill says keepAlive: true, the checklist item should check for keepAlive, not just "production configuration".
One practice per item. Don't combine "SSL and graceful shutdown" into one item.

Exit criteria

Every practice in the SKILL.md checklist has a corresponding verifier checklist item.

Phase 6: Publish and Eval Existing Scenarios

Goal: Confirm no regression on what was already working.

tessl tile lint <path>
tessl tile publish --bump patch <path>

If scenarios aren't already downloaded locally:

tessl scenario list --workspace <workspace> --json
# Find the generation ID for this tile
tessl scenario download <generation-id> --output <path>/evals

Run evals:

tessl eval run --label "<version> - <brief description>" <path>
tessl eval view <eval-id>

What to check

High-uplift scenarios still score 100% with context. If they regressed, something in the rewrite broke what was working.
Baseline scores haven't changed. They shouldn't — the baseline runs without the skill.
No new failures on previously-passing items.

Exit criteria

All previously-passing scenarios still pass. No regressions.

Phase 7: Generate New Scenarios and Eval

Goal: Measure uplift on the new practice areas.

tessl scenario generate --count <N> <path>
# Wait for completion
tessl scenario view <generation-id>
# Download when complete
tessl scenario download <generation-id> --output <path>/evals
# Run full eval (original + new scenarios)
tessl eval run --label "<version> - new scenarios" <path>

Request roughly one scenario per major new practice area.

Analysing results

Build the full uplift table across all scenarios (original + new):

Scenario	Baseline	With Skill	Uplift
...	...	...	...

New scenarios with +10 or more uplift — confirms the new practices are teaching agents something they didn't know. This is the primary success metric.

New scenarios with low uplift — the practice may not be as commonly missed as expected, or the skill's guidance isn't clear enough. But first check whether the task description is leaking the answer (see Phase 8).

With-context scores below 95% — drill into item-level scores. The skill is trying to teach this but the agent isn't learning it from the current wording. This is the input for Phase 9.

Exit criteria

You have uplift measurements for every practice area. You know which practices delivered value and which didn't.

Phase 8: Audit Scenario Quality

Goal: Ensure eval scenarios are measuring skill value, not task leakage.

After the first full eval run, read every task.md and check for answer leakage — implementation hints in the task description that inflate the baseline and hide the skill's real contribution.

Detecting answer leakage

For each scenario, compare the baseline score against what the task description reveals. A high baseline (85%+) on a practice the skill is meant to teach is a red flag. Read the task and ask: could an agent pass these criteria just by following the hints in the task description, without the skill?

Common leakage patterns:

Naming the solution in the problem description. "idle connections are being dropped by the load balancer" → agent adds keepAlive. Instead say "the service becomes unresponsive during low-traffic periods."
Describing the root cause. "values read from the database were JavaScript strings rather than numbers" → agent adds type parsers. Instead describe the symptom: "balance calculations are sometimes wrong."
Naming the anti-pattern to avoid. "the junior engineer maps over arrays to build placeholder lists" → agent avoids that exact pattern. Instead say "code review flagged inconsistent query patterns."
Prescribing the architecture. "DDL must never run at application startup; use numbered SQL migration files" → agent does exactly that. Instead say "the team needs the database schema set up."

The good task formula

A good task describes:

What to build (the feature or module)
What's going wrong (symptoms, not causes)
What success looks like (output specification)

A good task does NOT describe:

The implementation pattern to use
The root cause of the problem
The specific configuration values needed

Benchmark: Scenario 7 (bulk inserts). Task says "took 90 seconds for 2,000 items because of individual round trips, make it faster." Describes the problem without naming unnest. Baseline: 56%. Uplift: +44. This is what a well-written task looks like.

Proactive application, not instructed implementation

The most valuable thing a best-practice skill can do is make an agent proactively apply a practice when the task doesn't ask for it. This is the difference between:

Instructed: "Build a checkout form with idempotency protection" → any agent can do this
Proactive: "Build a checkout form that submits orders" → only a skill-equipped agent adds idempotency

The best scenarios describe a business requirement and let the criteria check whether the agent applied the best practice without being told to. The skill's job is to make the agent think "this is a POST endpoint that creates resources — I should make it idempotent" or "this is behind a reverse proxy — I need ProxyFix."

How to write proactive scenarios:

Describe the feature, not the practice. "Build an order submission page" not "Build an idempotent order submission page."
Include context that makes the practice relevant, but don't name it. "The app runs behind nginx" (implies ProxyFix needed) not "add ProxyFix middleware."
Criteria check for the practice the agent should have applied proactively.

How to detect instructed scenarios: If the task mentions the practice by name (e.g., "add rate limiting", "implement idempotency", "configure CORS"), the baseline agent will do it too. The scenario is testing implementation skill, not judgment. These scenarios have high baselines and low uplift — the skill can't add value because the task already told the agent what to do.

The litmus test: Read the task and ask: "Would a junior developer who has never heard of [practice X] know to add it based on this task description?" If yes, the task is instructing. If no, the task is testing proactive application — and that's where skills shine.

Example — idempotency:

BAD task: "Build a checkout form that prevents duplicate order submissions using idempotency keys"
GOOD task: "Build a checkout page where customers enter their details and place an order via POST /api/orders"
The criteria then check: Does the agent add an idempotency key header? Does it disable the submit button during submission? Does it handle retries?
A baseline agent builds a working form but doesn't think about duplicates. A skill-equipped agent proactively adds idempotency protection.

Fixing leaky tasks

Rewrite leaky task.md files, then re-run evals. Expect:

Baselines to drop on rewritten scenarios (the agent no longer gets free hints)
With-skill scores to stay high (the skill is doing the teaching, not the task)
Overall uplift to increase (the delta between baseline and with-skill widens)

Checking criteria quality

Also read every criteria.json and check for:

Criteria too narrow — only accepts one valid implementation when multiple are correct. Example: a criterion that requires = ANY($1::text[]) but the correct pattern for a PostgreSQL array column is the && (overlap) operator. Fix by broadening the description to accept all valid approaches.
Criteria with no trigger in the task — checks for a practice the task gives no reason to use. Example: checking for check_violation error handling when the schema has no CHECK constraints. Fix by adding context to the task that makes the practice relevant.
Double-counting — two criteria that reward the same underlying change. Merge into one.
Free points — criteria that score 100% on every baseline (e.g., "no unrelated changes"). Remove unless the scenario specifically tests scope discipline.

Exit criteria

Task descriptions describe problems without prescribing solutions. Criteria accept all valid implementation approaches. Baselines reflect what agents actually know, not what the task told them.

Phase 9: Iterate on Gaps

Goal: Close remaining gaps where with-context score is below 95%.

For each underperforming checklist item, diagnose which of these four root causes applies:

1. Skill gap — the skill doesn't teach it

The skill doesn't cover the practice, or covers it too vaguely. Fix by adding a specific code example to the SKILL.md.

2. Skill mismatch — the skill teaches a different variant

The skill shows one version of a pattern but the scenario needs a different variant. Example: skill shows ANY($1::int[]) but the scenario involves a text[] array column needing the && operator. Fix by adding variant examples to the skill.

3. Criteria problem — the rubric rejects valid code

The agent's output is correct but the criteria description is too narrow. Fix by broadening the criteria to accept alternative valid approaches. Don't change the skill to match a bad rubric.

4. Context interference — the skill distracts the agent

The with-context score is lower than baseline (regression). The skill's content led the agent away from the right answer. Check for:

Contradictory guidance between skill sections
An overly long skill where the relevant section gets buried
The skill emphasizing a different pattern that the agent applies instead

Fix by clarifying, not adding more content. Sometimes removing or shortening a section helps more than expanding it.

Process for each gap

Read the failing criterion description
Read what the skill currently teaches about this topic
Classify as gap / mismatch / criteria problem / interference
Apply the targeted fix
Republish and re-eval to confirm

Exit criteria

All scenarios score 95%+ with context, or remaining gaps are explained by criteria issues rather than skill deficiencies.

Metrics

Uplift = with-context score minus baseline score. This is the value the skill provides.
Average uplift across all scenarios is the headline metric.
Per-scenario uplift reveals which topics help most.
Per-item scores reveal specific practices agents miss.

Anti-patterns

Adding content agents already know. If baseline is 100%, teaching that topic wastes tokens. Verify with evals before adding.
Being vague. "Use proper error handling" teaches nothing. "Check err.code === '23505' for unique violations" teaches a specific pattern.
Optimizing for rubric instead of real-world value. Don't distort the skill to game narrow rubric items.
Covering too many topics shallowly. Deep coverage of 5 high-impact practices beats shallow coverage of 15. Token budget is finite.
Skipping references. Without references, practices look like opinions. With links to official docs and community issues, they're verifiable facts.
Ignoring item-level scores. Scenario-level averages hide the signal. A scenario at 90% might have one item at 0% and the rest at 100% — that's a targeted fix, not a rewrite.
Trusting high baselines at face value. A baseline of 90% might mean agents know this practice — or it might mean the task description told them what to do. Always read the task.md before concluding a practice is already well-known.
Only showing one variant of a pattern. If a practice has multiple valid forms (e.g., = ANY for scalar columns vs && for array columns), show both. Agents apply exactly what the skill demonstrates.
Blaming the skill when the criteria are wrong. If the agent produces correct code that the rubric rejects, fix the criteria — don't warp the skill to match a narrow rubric.
Writing scenarios that instruct instead of test judgment. A task that says "add idempotency" tests implementation, not whether the agent knows to add it. Write tasks that describe business requirements and check if the agent proactively applies the practice. This is where best-practice skills provide their highest value.

Workspace: tessl-labs
Visibility: Public
Created: 18 days ago
Last updated: 17 days ago
Publish Source: CLI
Badge