CtrlK
BlogDocsLog inGet started
Tessl Logo

cekura-eval-design

Use when the user asks to "create an evaluator", "create evals", "create a scenario", "write a test scenario", "design a test case", "test my agent", "build eval coverage", "plan a test suite", "create red team tests", "set up test profiles", "configure conditional actions", "write a conditional action evaluator", "build a deterministic test", "design an IVR test", "IVR navigation test", "write a unit test for a voice agent", "build a regression test", "scripted scenario", "scripted voice test", "structured evaluator", "exact flow test", "sequential conditions", "fixed sequence test", or "run evals". Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions (deterministic / unit test / regression / IVR navigation flows), and best practices for workflow / red-team / edge-case / deterministic test types.

64

Quality

76%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./cekura/skills/cekura-eval-design/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

82%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description excels at trigger term coverage with an exhaustive list of user phrases that would activate the skill, and it clearly addresses both 'what' and 'when'. However, the capability description reads more like a category list than concrete actions, and some trigger terms are generic enough to potentially conflict with general testing skills. The description is also quite verbose and front-loads a massive trigger list before explaining what the skill actually does.

Suggestions

Restructure to lead with concrete capability verbs (e.g., 'Designs evaluators, builds test suites, configures conditional actions for voice agents and IVR flows') before the trigger phrase list to improve specificity.

Narrow generic trigger terms like 'write a unit test' and 'build a regression test' by qualifying them (e.g., 'write a unit test for a voice agent') consistently to reduce conflict risk with general testing skills.

DimensionReasoningScore

Specificity

The description names the domain (evaluator/test creation for voice agents) and lists several actions like 'individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions', but these are listed as categories rather than concrete actionable verbs. It's more of a topic list than specific capabilities.

2 / 3

Completeness

The description explicitly answers both 'when' (via the extensive 'Use when the user asks to...' clause with many trigger phrases) and 'what' (covers evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions, and best practices for various test types). Both dimensions are clearly addressed.

3 / 3

Trigger Term Quality

The description includes an extensive list of natural trigger phrases users would say, such as 'create an evaluator', 'test my agent', 'build eval coverage', 'write a unit test for a voice agent', 'IVR navigation test', 'run evals', and many more variations. This provides excellent coverage of how users would naturally phrase requests.

3 / 3

Distinctiveness Conflict Risk

While the description is specific to evaluator/test creation for voice agents and IVR flows, terms like 'write a unit test', 'build a regression test', and 'design a test case' are generic enough to potentially conflict with general testing skills. The voice agent and IVR-specific terms help distinguish it, but the overlap risk with general test-writing skills is notable.

2 / 3

Total

10

/

12

Passed

Implementation

70%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a highly actionable and well-structured skill with excellent progressive disclosure and clear multi-step workflows with validation checkpoints. Its primary weakness is significant verbosity — the inline content is roughly 3x longer than necessary, with repeated guidance across sections (personality advice, mode selection, tool strategy), extensive decision tables that belong in reference files, and sections like the Pre-Creation Checkpoint that largely restate earlier content. The token cost is substantial for what could be a much leaner overview.

Suggestions

Move the 15-row 'Concrete examples (which mode for which scenario)' table and the personality distribution table to a reference file, keeping only the decision rules and 2-3 inline examples.

Collapse the Pre-Creation Checkpoint section to a numbered checklist of items to confirm, removing the explanatory paragraphs that restate guidance from earlier sections (tool strategy, personality selection, authoring mode).

Remove the 'Why This Matters' subsection under Pre-Creation Checkpoint — it explains consequences Claude can infer, and the checkpoint list itself is sufficient.

Deduplicate personality guidance: the 'Picking the Right Personality' subsection, the checkpoint's personality item, and the 'Checking Available Personalities' subsection all cover overlapping ground — consolidate into one concise block.

DimensionReasoningScore

Conciseness

This skill is extremely verbose at ~600+ lines. It over-explains concepts like what test profiles are, repeats guidance across sections (e.g., personality selection advice appears in multiple places), includes extensive tables that could be in reference files, and provides lengthy mode-selection decision trees inline. The 'Choosing Authoring Mode' section alone has a 15-row table that belongs in a reference file. The Pre-Creation Checkpoint section largely restates decisions already covered earlier.

1 / 3

Actionability

The skill provides fully concrete, executable guidance: specific API endpoints with HTTP methods, complete JSON payload examples with all required fields, exact field names and types, specific personality IDs (693, 362), concrete tag syntax with constraints, and copy-paste-ready payload skeletons. The worked examples for conditional actions and instructions are immediately usable.

3 / 3

Workflow Clarity

The 10-step 'Eval Design Workflow' is clearly sequenced with explicit validation checkpoints (Pre-Creation Checkpoint at step 4, review artifacts at step 6, validation checklist at step 7 of conditional actions authoring). The conditional actions authoring sequence has 7 ordered steps with a terminal validation checklist. Feedback loops are present (run → review transcripts → iterate, generate → check partial completion → regenerate remainder).

3 / 3

Progressive Disclosure

The skill has a clear overview structure with well-signaled one-level-deep references to 7 reference files and 3 example files. References are contextually placed (e.g., 'See references/conditional-actions.md for the full rule and worked examples') rather than dumped at the end. The Additional Resources section provides a clean index. Content is appropriately split between inline essentials and reference details.

3 / 3

Total

10

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (527 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
cekura-ai/cekura-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.