Use when the user asks to "create an evaluator", "create evals", "create a scenario", "write a test scenario", "design a test case", "test my agent", "build eval coverage", "plan a test suite", "create red team tests", "set up test profiles", "configure conditional actions", "write a conditional action evaluator", "build a deterministic test", "design an IVR test", "IVR navigation test", "write a unit test for a voice agent", "build a regression test", "scripted scenario", "scripted voice test", "structured evaluator", "exact flow test", "sequential conditions", "fixed sequence test", or "run evals". Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions (deterministic / unit test / regression / IVR navigation flows), and best practices for workflow / red-team / edge-case / deterministic test types.
64
76%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./cekura/skills/cekura-eval-design/SKILL.mdQuality
Discovery
82%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description excels at trigger term coverage with an exhaustive list of user phrases that would activate the skill, and it clearly addresses both 'what' and 'when'. However, the capability description reads more like a category list than concrete actions, and some trigger terms are generic enough to potentially conflict with general testing skills. The description is also quite verbose and front-loads a massive trigger list before explaining what the skill actually does.
Suggestions
Restructure to lead with concrete capability verbs (e.g., 'Designs evaluators, builds test suites, configures conditional actions for voice agents and IVR flows') before the trigger phrase list to improve specificity.
Narrow generic trigger terms like 'write a unit test' and 'build a regression test' by qualifying them (e.g., 'write a unit test for a voice agent') consistently to reduce conflict risk with general testing skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description names the domain (evaluator/test creation for voice agents) and lists several actions like 'individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions', but these are listed as categories rather than concrete actionable verbs. It's more of a topic list than specific capabilities. | 2 / 3 |
Completeness | The description explicitly answers both 'when' (via the extensive 'Use when the user asks to...' clause with many trigger phrases) and 'what' (covers evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions, and best practices for various test types). Both dimensions are clearly addressed. | 3 / 3 |
Trigger Term Quality | The description includes an extensive list of natural trigger phrases users would say, such as 'create an evaluator', 'test my agent', 'build eval coverage', 'write a unit test for a voice agent', 'IVR navigation test', 'run evals', and many more variations. This provides excellent coverage of how users would naturally phrase requests. | 3 / 3 |
Distinctiveness Conflict Risk | While the description is specific to evaluator/test creation for voice agents and IVR flows, terms like 'write a unit test', 'build a regression test', and 'design a test case' are generic enough to potentially conflict with general testing skills. The voice agent and IVR-specific terms help distinguish it, but the overlap risk with general test-writing skills is notable. | 2 / 3 |
Total | 10 / 12 Passed |
Implementation
70%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a highly actionable and well-structured skill with excellent progressive disclosure and clear multi-step workflows with validation checkpoints. Its primary weakness is significant verbosity — the inline content is roughly 3x longer than necessary, with repeated guidance across sections (personality advice, mode selection, tool strategy), extensive decision tables that belong in reference files, and sections like the Pre-Creation Checkpoint that largely restate earlier content. The token cost is substantial for what could be a much leaner overview.
Suggestions
Move the 15-row 'Concrete examples (which mode for which scenario)' table and the personality distribution table to a reference file, keeping only the decision rules and 2-3 inline examples.
Collapse the Pre-Creation Checkpoint section to a numbered checklist of items to confirm, removing the explanatory paragraphs that restate guidance from earlier sections (tool strategy, personality selection, authoring mode).
Remove the 'Why This Matters' subsection under Pre-Creation Checkpoint — it explains consequences Claude can infer, and the checkpoint list itself is sufficient.
Deduplicate personality guidance: the 'Picking the Right Personality' subsection, the checkpoint's personality item, and the 'Checking Available Personalities' subsection all cover overlapping ground — consolidate into one concise block.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | This skill is extremely verbose at ~600+ lines. It over-explains concepts like what test profiles are, repeats guidance across sections (e.g., personality selection advice appears in multiple places), includes extensive tables that could be in reference files, and provides lengthy mode-selection decision trees inline. The 'Choosing Authoring Mode' section alone has a 15-row table that belongs in a reference file. The Pre-Creation Checkpoint section largely restates decisions already covered earlier. | 1 / 3 |
Actionability | The skill provides fully concrete, executable guidance: specific API endpoints with HTTP methods, complete JSON payload examples with all required fields, exact field names and types, specific personality IDs (693, 362), concrete tag syntax with constraints, and copy-paste-ready payload skeletons. The worked examples for conditional actions and instructions are immediately usable. | 3 / 3 |
Workflow Clarity | The 10-step 'Eval Design Workflow' is clearly sequenced with explicit validation checkpoints (Pre-Creation Checkpoint at step 4, review artifacts at step 6, validation checklist at step 7 of conditional actions authoring). The conditional actions authoring sequence has 7 ordered steps with a terminal validation checklist. Feedback loops are present (run → review transcripts → iterate, generate → check partial completion → regenerate remainder). | 3 / 3 |
Progressive Disclosure | The skill has a clear overview structure with well-signaled one-level-deep references to 7 reference files and 3 example files. References are contextually placed (e.g., 'See references/conditional-actions.md for the full rule and worked examples') rather than dumped at the end. The Additional Resources section provides a clean index. Content is appropriately split between inline essentials and reference details. | 3 / 3 |
Total | 10 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (527 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
24ad1d0
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.