Use when migrating, restructuring, publishing, or auditing an existing Claude skill into a Tessl tile; converting flat .md files or SKILL.md bundles; fixing Tessl Quality, Impact, Uplift, frontmatter, metadata, tile.json summary, README, markdown reference links, registry-vs-local Quality gaps, artifact anchors, auto-eval wait discipline, or pushing tile scores from 88-99% to 100%.
100
100%
Does it follow best practices?
Impact
100%
1.11xAverage score across 4 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent applies the migrate-to-tessl eval-count aggregate rule in a noisier score-triage setting and can design a stronger future eval scenario for the same failure mode. The output must diagnose that only two eval scenarios cap aggregate at 0.9333 even though all score-bearing dimensions are perfect, reject security/manual verification/wording churn, and propose a non-obvious scenario repair.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Writes requested files",
"description": "score-triage.md and scenario-repair-plan.md both exist and cover the requested diagnosis and scenario-repair sections",
"max_score": 6
},
{
"name": "Identifies eval count as blocker",
"description": "score-triage.md states that the aggregate score is capped because the target tile has fewer than 3 eval scenarios, not because the skill content is still weak",
"max_score": 10
},
{
"name": "Cites exact count and aggregate",
"description": "score-triage.md cites `scores.evals.count: 2` and `scores.aggregate: 0.9333` (or 93%) for sharaf/release-note-architect from inputs/registry-search.json as key evidence",
"max_score": 8
},
{
"name": "Recognizes perfect score-bearing dimensions",
"description": "score-triage.md notes that the target tile's `scores.quality`, `scores.impact`, and `scores.evals.average` are all `1`, so those dimensions are already satisfied",
"max_score": 8
},
{
"name": "Explains three-scenario threshold",
"description": "score-triage.md explains that observed aggregate caps are 0.8667 for one scenario, 0.9333 for two scenarios, and 1.0 for three or more scenarios when other score-bearing fields are perfect",
"max_score": 8
},
{
"name": "Uses comparison evidence",
"description": "score-triage.md uses the comparison entries in registry-search.json (one-scenario 0.8667 and three-scenario 1.0 examples) to support the eval-count diagnosis",
"max_score": 8
},
{
"name": "Rejects wrong hypotheses",
"description": "score-triage.md explicitly rejects security advisory, missing manual verification, Quality, Impact, README/SKILL wording churn, and other broad content-polish passes as the next fix for this specific aggregate gap",
"max_score": 8
},
{
"name": "Prescribes adding a real third scenario",
"description": "score-triage.md recommends adding at least one real, quality-checked eval scenario so the target tile has at least 3 scenarios total",
"max_score": 6
},
{
"name": "Includes publish and auto-eval wait",
"description": "score-triage.md says to publish a new version with the added scenario and wait for the exact auto-eval run to complete before judging the score",
"max_score": 8
},
{
"name": "Verifies with registry search JSON",
"description": "score-triage.md uses `tessl search --json release-note-architect` or equivalent as the final verification source and says to check aggregate 1 plus eval count at least 3",
"max_score": 5
},
{
"name": "Uses tile info appropriately",
"description": "score-triage.md treats `tessl tile info` as useful summary evidence for Quality/Security status, but not as the source of the total aggregate score",
"max_score": 4
},
{
"name": "Identifies answer leakage risk",
"description": "scenario-repair-plan.md explains that a direct prompt asking why the score is 93% with an obvious `evals.count: 2` field is too easy and can inflate baseline performance",
"max_score": 5
},
{
"name": "Designs harder fixtures",
"description": "scenario-repair-plan.md proposes noisy fixture inputs such as registry search results with comparison tiles, tile-info output with security distractors, teammate hypotheses, and/or eval inventory rather than a single obvious score object",
"max_score": 7
},
{
"name": "Proposes concrete task framing",
"description": "scenario-repair-plan.md frames the future eval as score triage or migration incident response that requires diagnosing and planning the fix, rather than merely answering a leading question",
"max_score": 4
},
{
"name": "Proposes weighted rubric totaling 100",
"description": "scenario-repair-plan.md includes a proposed weighted rubric whose checks total 100 points and cover eval-count diagnosis, wrong-hypothesis rejection, add-scenario fix, publish/wait verification, and non-obvious scenario design; answer-leakage risk may be handled in the rubric or adjacent scenario-design guidance",
"max_score": 5
}
]
}