Convert skills to Tessl tiles and create eval scenarios to measure skill effectiveness.
Overall
score
92%
Does it follow best practices?
Validation for skill structure
Generate evaluation scenarios that measure how useful and effective a skill is. Together the scenarios will cover every line by line instruction in the skill, and provide evidence for whether it is followed when its relevant to do so.
Good scenarios will contain tasks where a general-purpose agent can produce some solution, but only some of the reasonable solutions will follow the instructions of the skill. E.g. if the skill says to use a particular library, the task won't proscribe the library but the criteria will.
The scenarios will be used in an eval harness with the following constraints:
Read SKILL.md thoroughly — this is the entry point. Then explore the rest of the skill folder, including subfolders like references/ and scripts/. Files referenced from SKILL.md often hold the most specific guidance.
Extract every instruction in the skill, that directs an agent to do or not do a specific thing:
Write a file called "instructions.json", including all the instructions you've found:
{
"instructions" : [
{
"instruction": "<instruction from the skill>",
"original_snippets": "<substring from the original text, including context. Separate with ...>",
"relevant_when": "<description of type of scenario where this would kick in",
"why_given": "<reminder|new knowledge|particular prefence>"
}
]
}For relevant_when, describe the type of scenarios that would make this instruction relevant (e.g. when writing type script code, when setting up a new database etc). For why_given, give your best guess for why the skill writer included this instruction -- is it a simple reminder to the agent of something it might know, is it new knowledge that the agent may not know, or is it expressing a preference among many options?
Plan 5 scenario ideas before writing any files. Group the instructions to design scenarios that will cover as many of the instructions as possible, by setting up conditions that match the "relevant_when" cases.
summary_infeasible.json instead.For each feasible scenario, create scenario-{idx} (0-indexed: scenario-0, scenario-1, ...) with three files.
capability.txtA couple-word summary of the instructions being tested (e.g. "Correct directory structure") or the type of task being done.
criteria.jsonThe checklist that will be used to evaluate the final artifacts from the solution.
{
"context": "<2-3 sentence overview>",
"type": "weighted_checklist",
"checklist": [
{
"name": "<short name (1-4 words)>",
"description": "<what is being evaluated — keep conceptual, not exact names>",
"max_score": "<number>"
}
]
}Writing binary criteria:
| Avoid (causes partial credit) | Use instead (binary) |
|---|---|
| "Repeatedly emphasizes X" | "Contains at least 3 of: 'term1', 'term2', 'term3'" |
| "Uses non-standard layout" | "Uses at least ONE of: asymmetric grid, overlapping elements, rotated content" |
| Subjective: "clear", "creative" | Presence checks: "Includes X", "Does NOT use Y" |
example
{
"context": "Tests whether the agent uses the composition plan workflow for fine-grained music control, handles mutually exclusive parameters correctly, and avoids deprecated packages.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Correct client package",
"description": "Uses the @elevenlabs/elevenlabs-js, not the deprecated elevenlabs package",
"max_score": 10
},
{
"name": "Composition plan generation",
"description": "Uses music.composition_plan.create() followed by music.compose() to generate a composition plan",
"max_score": 10
},
{
"name": "Mutually exclusive params",
"description": "Does NOT pass both prompt and composition_plan to compose() — uses one or the other",
"max_score": 10
},
... // other rubrics directly testing what the skill suggests
]
}(Abbreviated — real rubrics will likely have 10-12 items summing to 100.)
task.mdDescribes a realistic problem that naturally requires that the skill instructions are relevant — but without revealing the instructions or hinting at them strongly. Make it so a competant agent could solve it in ways that the skill didn't specify. Make it challenging and interesting rather than testing what's obvious.
# [Task Title]
## Problem/Feature Description
[1-2 paragraphs. Construct a believable business scenario around the skill
instructions being tested. Explain who needs this, what problem or gap exists,
what's already available. The scenario should make the skill's guidance
relevant and necessary. Write as a story, not a list of constraints.]
## Output Specification
[Give details on what should be produced.
If you ask the agent to produce a script to automate doing the task or to produce a log of the process, include those files as an output as well.
Name the expected output files and formats, unless these are part of the instructions and will give away the criteria.]
## Input Files (optional)
[Provide inlined input files that can create a starting state to work on if necessary , e.g. files before an edit.
These **must be fully generated** when the task file is written, no additional files will be available at run time.
Do not describe files to be provided later, DO NOT mention that this is an eval]
The following files are provided as inputs. Extract them before beginning.
=============== FILE: inputs/example.txt ===============
[file contents]The task must be self-contained and actionable. An agent should be able to read the task and immediately start working using only the task description plus the skill (or their own knowledge). If the task is too vague to act on, add more context — but focus on the problem, not the solution.
The task should not have a large number of prerequisites or leave large files.
Good vs bad tasks: Don't hint too heavily at the solution or give away details that will make it easy to pass.
| Skill Instruction | Good Task (problem-framed) | Bad Task (instruction-framed) |
|---|---|---|
| Use Batch API | "The ops team needs to migrate 500 users to a new email domain after a company rebrand" | "Update users using batch API" |
| Create A, B and C for each service | "Set up the initial cloud infrastructure for a new microservice" | "Create files A, B and C" |
| Use input validation | "Set up the form to collect user information" | "Add validation: name (string) and phone number (number)" |
| Use a valid model (one of X, Y and Z) | "The customer asks you to pick a reasonable configuration" | "The customer wants to use model X, and is based in region A" |
Check for instruction leakage in each scenario:
Check for eval feasibility
Run these checks across all scenarios:
capability.txt, task.md, criteria.jsonFinally:
summary.json:
{
"total_scenarios": <number>,
"instructions_coverage": {
"total_instructions": <from instructions.json>,
"instructions_tested": <unique instructions within scenarios>,
"coverage_percentage": <percentage>
},
"reason_distribution": {
"reminder": "<count>",
"new knowledge": "<count>",
"preference": "<count>"
}
}summary_infeasible.json:
{
"total_infeasible": <number>,
"infeasible_scenarios": [
{
"scenarios": "<scenarios name>",
"reasoning": "<why this cannot be evaluated>"
}
]
}