Automated pipeline that takes a company name and produces a custom Tessl skill plus an eval report showing per-scenario lift (baseline agent vs with-skill agent). A1 MVP cell of the produce/consume × personalization 2x2.
88
86%
Does it follow best practices?
Impact
89%
1.45xAverage score across 13 eval scenarios
Advisory
Suggest reviewing before use
Process steps in order. Do not skip ahead.
This skill consumes a selection.json (the output of the select-target skill) and runs the full downstream pipeline: skill scaffold → skill body → scenarios → bleeding/leaking audit → review gate → baseline+with-skill evals (one call, two variants) → lift analysis → report. Every step is deterministic from the prior step's output; no human input after Step 1.
The skill is mode-agnostic — it works for produce-mode (A1) and consume-mode (B1/B2) selections identically. The difference between modes lives upstream in discovery; from the selection forward, the pipeline is the same.
Note on phase ordering vs. the spec. The spec's workflow numbers evals (steps 4–5) before skill generation (step 6) because logically the eval task is independent of the implementation. Tessl's tooling, however, generates scenarios from an existing skill/tile (tessl scenario generate <tile-path>), so the implementation here scaffolds the skill first, then generates scenarios from it. The final deliverables (skill + scenarios + lift + report) match the spec contract; only the execution order differs.
Input is one of:
selection.json, orruns/<UTC-timestamp>/<slug>/selection.json by lex-sort descending of the timestamp directory name.Load the file. The first action is to branch on selection_status BEFORE doing anything else. Do not scaffold a skill, do not invoke any tessl command, do not generate scenarios — read the status first.
selection_status:selected with a populated selected_target_id: proceed to Step 2 (the only branch that runs the pipeline).skipped: the human (or auto-pick gate) rejected all candidates. Output a brief explanation that surfaces:
selection_status: skipped from the selection fileselection_rationale text from the selection file (or a faithful paraphrase if the field is unusually long)tessl skill new, tessl scenario generate, tessl skill review, or tessl eval run command. The pipeline halts at this step by design — that halt IS the correct behavior, not a degraded one.defer: the human postponed the decision; no selection.json should normally exist in this state. If you encounter one, surface the situation and finish here without running any pipeline step.selection_status == "selected":Resolve the linked discovery.json from discovery_path. Locate the selected target by selected_target_id in discovery.skill_targets[]. Cache in working memory:
run_dir — the directory containing the selection.json (where all subsequent artifacts will be written)company — discovery.companymode — discovery.mode (defaults to "consume" for schema_version < 3)target — the selected skill_target entrydomain_signal, product_surface, agentic_landscape — context for skill draftingProceed immediately to Step 2.
A pre-0.1.2 version of this skill had the branch buried mid-paragraph and the agent ignored it — given a skipped selection, the agent scaffolded a skill, generated scenarios, ran the review gate, attempted to run evals, and fabricated lift numbers in the run log. The published-time eval be-skipped-selection regressed from baseline 0.59 to with-skill 0.21 on exactly this failure mode. The branch is the first action in the step because the agent must commit to halting before any downstream tooling executes.
Run tessl skill new to create the skill's tile scaffold:
tessl skill new \
--name "<slugified-target-title>" \
--description "<one-line description derived from target.title and target.rationale>" \
--path "<run_dir>/generated-skill"This produces <run_dir>/generated-skill/ containing tile.json, a starter SKILL.md, and an empty evals/ directory.
If the path already exists from a prior run, remove it first — re-running on the same selection should produce a fresh scaffold, not a merge.
Proceed immediately to Step 3.
Fill in <run_dir>/generated-skill/SKILL.md for the selected target. The draft must follow rules/skill-authoring.md:
description includes trigger phrases derived from the target's daily-work language## Step 1 — ...); one action per step; no decimals<run_dir>/generated-skill/<file>.md) and referenced from the skill, per the keep-skills-compact ruleIf the target's kind is api_wrapper or workflow_skill, also seed <run_dir>/generated-skill/scripts/ with stub scripts the SKILL.md references (one stub per script — they can be expanded if eval runs surface gaps).
Proceed immediately to Step 4.
Run tessl scenario generate against the scaffolded tile:
tessl scenario generate --count 5 --json "<run_dir>/generated-skill"The --count 5 matches the MVP requirement (5 scenarios per company). The command runs server-side; capture the generation id from the JSON output and download the scenarios to disk:
tessl scenario download --last --output "<run_dir>/generated-skill/evals"The scenarios land under <run_dir>/generated-skill/evals/ in Tessl's canonical scenario format.
Generation skews to happy-path cases (per rules/plugin-evals.md). After download, hand-author at least one negative-case scenario (refuse bad input / produce silence when nothing actionable) and write it alongside the generated scenarios. Use an existing generated scenario as a structural template.
Proceed immediately to Step 5.
Run skills/build-and-evaluate/scripts/audit-scenarios.py <run_dir>/generated-skill/evals/scenarios.json. The script enforces the No-Bleeding rule from rules/plugin-evals.md (no criterion's expected literal appears verbatim in its task description).
If the audit fails, regenerate the offending scenarios (return to Step 4) with the violation list as context for tessl scenario generate. Cap regeneration attempts at 3 — if scenarios still fail, escalate to the user and finish here.
If audit passes, proceed immediately to Step 6.
Run the skill-review quality gate:
tessl skill review --threshold 85 --json "<run_dir>/generated-skill" > "<run_dir>/skill-review.json"Branch on the result:
tessl skill review --threshold 85. Cap re-review attempts at 3.rules/context-artifacts.md) and do NOT run tessl skill review --optimize and ship verbatim (also forbidden by the same rule).Proceed immediately to Step 7.
Run baseline (skill not loaded) and with-skill (skill loaded) in a single Tessl call using its built-in variant mechanism:
tessl eval run \
--variant without-context \
--variant with-context \
--agent claude:claude-sonnet-4-6 \
--label "<run_dir>/generated-skill — A1 baseline+with-skill" \
--json \
"<run_dir>/generated-skill" > "<run_dir>/eval-run.json"without-context is the baseline (no skill context loaded into the solver); with-context is the with-skill run (full tile context loaded). Tessl runs both against the same scenarios in one invocation — this is the canonical baseline-vs-with-skill comparison shape.
If the eval run fails (network, project-not-linked, missing workspace), run tessl doctor to diagnose, fix, and retry once. Do not skip the eval — the lift number is the report's load-bearing output.
Proceed immediately to Step 8.
The Tessl eval-run output contains per-scenario per-variant scores. Parse it into the canonical two-file shape that compute-lift.py consumes:
<run_dir>/baseline-results.json — {"mode": "baseline", "scenarios": [{"id": ..., "score": ...}], "aggregate_score": ...} from the without-context variant<run_dir>/with-skill-results.json — same shape from the with-context variantIf the eval-view JSON shape isn't directly compatible, run tessl eval view --last --json > <run_dir>/eval-view.json to fetch a full structured result; reason over its variants key to build the two canonical files. Document the schema in a TODO comment if it doesn't match expectations — the script can be promoted to handle Tessl's format directly once the schema stabilizes.
Then:
python3 skills/build-and-evaluate/scripts/compute-lift.py \
"<run_dir>/baseline-results.json" \
"<run_dir>/with-skill-results.json" \
> "<run_dir>/lift.json"The lift script flags near_zero_scenarios where |lift| < 0.05 — these are surfaced in Step 9.
Proceed immediately to Step 9.
Reason over the lift data and the failed criteria from both eval variants to identify:
lift < 0.10, classify the cause: leaked technique in the task description (re-audit needed), criterion grading universal competence (rewrite criteria), or the skill genuinely doesn't help (skill needs targeted improvement).high.Write the analysis to <run_dir>/gap-analysis.md. Keep it under one page — the report compiles from this file.
Proceed immediately to Step 10.
Run skills/build-and-evaluate/scripts/render-report.py <run_dir> and output the absolute path of the rendered report. The script consumes:
<run_dir>/discovery.json — sources, selected target, rationale<run_dir>/selection.json — human (or auto) pick + selection rationale<run_dir>/skill-review.json — final review score<run_dir>/lift.json — per-scenario + aggregate lift<run_dir>/gap-analysis.md — improvement suggestionsAnd emits <run_dir>/report.md with the structure required by the spec's Output section: research found, per-scenario lift table, aggregate lift, improvement suggestions, links to the generated skill / scenarios / results.
Output the absolute path of <run_dir>/report.md and finish here.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
skills
batch-driver
build-and-evaluate
company-list-filter
discovery
discovery-produce
select-target