Automated pipeline that takes a company name and produces a custom Tessl skill plus an eval report showing per-scenario lift (baseline agent vs with-skill agent). A1 MVP cell of the produce/consume × personalization 2x2.
88
86%
Does it follow best practices?
Impact
89%
1.45xAverage score across 13 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "The agent must produce pipeline.sh that accurately encodes the build-and-evaluate skill's CLI commands with correct flags, ordering, and error-handling logic. The criteria check whether skill-specific flags and constraints (count, threshold, variant names, agent model, path handling) appear correctly.",
"type": "weighted_checklist",
"checklist": [
{
"name": "tessl skill new flags",
"description": "pipeline.sh includes a 'tessl skill new' command with --name, --description, and --path flags (all three must be present)",
"max_score": 8
},
{
"name": "Remove before scaffold",
"description": "pipeline.sh removes or checks for the existence of the generated-skill directory BEFORE running 'tessl skill new' (e.g., rm -rf, if [ -d ... ])",
"max_score": 8
},
{
"name": "Scenario generate flags",
"description": "pipeline.sh includes 'tessl scenario generate' with '--count 5' (exactly 5) AND '--json' flags both present",
"max_score": 12
},
{
"name": "Scenario download last flag",
"description": "pipeline.sh includes 'tessl scenario download' with the flag '--last'",
"max_score": 8
},
{
"name": "Negative case scenario",
"description": "pipeline.sh or a comment in it explicitly mentions hand-authoring a negative-case scenario (refuse/silence/bad-input) after the download step",
"max_score": 8
},
{
"name": "Skill review threshold",
"description": "pipeline.sh includes 'tessl skill review' with '--threshold 85' (exactly 85)",
"max_score": 10
},
{
"name": "Review retry cap",
"description": "pipeline.sh includes logic that caps re-review attempts (e.g., a loop with a counter, or a comment stating max 3 re-runs) and does NOT lower the threshold or add '--optimize'",
"max_score": 10
},
{
"name": "Eval variant without-context",
"description": "pipeline.sh includes 'tessl eval run' with '--variant without-context'",
"max_score": 8
},
{
"name": "Eval variant with-context",
"description": "pipeline.sh includes 'tessl eval run' with '--variant with-context'",
"max_score": 8
},
{
"name": "Eval agent flag",
"description": "pipeline.sh includes '--agent claude:claude-sonnet-4-6' on the 'tessl eval run' command",
"max_score": 10
},
{
"name": "compute-lift.py invocation",
"description": "pipeline.sh invokes 'compute-lift.py' with baseline-results.json as the first argument and with-skill-results.json as the second argument",
"max_score": 6
},
{
"name": "render-report.py invocation",
"description": "pipeline.sh invokes 'render-report.py' with run_dir as its argument",
"max_score": 4
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
skills
batch-driver
build-and-evaluate
company-list-filter
discovery
discovery-produce
select-target