CtrlK
BlogDocsLog inGet started
Tessl Logo

cekura-eval-design

Use when the user asks to "create an evaluator", "create evals", "create a scenario", "write a test scenario", "design a test case", "test my agent", "build eval coverage", "plan a test suite", "create red team tests", "set up test profiles", "configure conditional actions", "write a conditional action evaluator", "build a deterministic test", "design an IVR test", "IVR navigation test", "write a unit test for a voice agent", "build a regression test", "scripted scenario", "scripted voice test", "structured evaluator", "exact flow test", "sequential conditions", "fixed sequence test", or "run evals". Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions (deterministic / unit test / regression / IVR navigation flows), and best practices for workflow / red-team / edge-case / deterministic test types.

64

Quality

76%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./cekura/skills/cekura-eval-design/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

70%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a comprehensive, highly actionable skill for a complex domain (AI voice agent test design). Its greatest strength is actionability — concrete API payloads, specific field values, worked examples, and clear decision trees. Its primary weakness is extreme verbosity: the document is far too long, with repeated explanations, exhaustive trigger-phrase lists, and extensive good/bad comparisons that could be condensed significantly without losing clarity. The workflow structure and progressive disclosure to reference files are well-designed.

Suggestions

Condense the 'Choosing Authoring Mode' section dramatically — the prose explanation, the trigger phrase lists, and the 15-row examples table all convey the same decision logic. Replace with a compact decision tree or flowchart-style format.

Cut the 'Common Instruction Mistakes' section by ~60% — each mistake can be a single line (mistake → fix) rather than a multi-sentence explanation. Claude understands why hardcoding data is bad without a paragraph explaining it.

Remove or drastically shorten the 'Why This Matters' subsection under Pre-Creation Checkpoint — listing 6 consequences of skipping checkpoints is unnecessary padding when the checkpoint itself is already well-defined.

DimensionReasoningScore

Conciseness

This skill is extremely verbose at ~500+ lines. While it covers a complex domain, it includes extensive tables of trigger phrases, repeated cross-references, lengthy examples of bad vs good patterns, and detailed explanations that could be significantly condensed. Many sections re-explain concepts (e.g., the authoring mode decision logic is stated in prose, then in a table, then in examples). The 'Common Instruction Mistakes' section explains obvious anti-patterns at length. Claude doesn't need to be told why filler steps are bad in 4 sentences.

1 / 3

Actionability

The skill provides highly actionable guidance: concrete API endpoints with full payload schemas, executable JSON examples, specific field names and values, exact tool IDs (TOOL_END_CALL, TOOL_DTMF), step-by-step workflows, and copy-paste-ready payload skeletons. The conditional actions worked example is complete and directly usable.

3 / 3

Workflow Clarity

The 10-step 'Eval Design Workflow' is clearly sequenced with explicit validation checkpoints (Pre-Creation Checkpoint, post-generation review steps, validation checklist reference). The authoring sequence for conditional actions has 7 ordered steps with explicit 'skipping any of them is the most common cause of avoidable rework' warning. Feedback loops are present (run → review → iterate, generate → check → patch → supplement).

3 / 3

Progressive Disclosure

The skill has excellent progressive disclosure with a clear overview in the main file and well-signaled one-level-deep references to 7 reference files and 3 example files. References are contextually placed (e.g., 'See references/conditional-actions.md for the full rule and worked examples') and the Additional Resources section provides a clean index. However, bundle files were not provided so actual reference accuracy cannot be verified.

3 / 3

Total

10

/

12

Passed

Description

82%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at providing explicit trigger guidance with an extensive list of user phrases, ensuring Claude knows when to select this skill. However, the 'what it does' portion is more of a category listing than concrete action descriptions, and some trigger terms are generic enough to risk overlap with general testing skills. The description is also quite long and front-loads trigger terms rather than leading with a clear capability summary.

Suggestions

Lead with a concise capability summary using active verbs (e.g., 'Designs evaluators and test scenarios for voice agents, generates conditional action configurations, and plans test suite coverage') before the trigger term list.

Add domain-specific qualifiers to generic trigger terms to reduce conflict risk (e.g., instead of 'write a unit test' use context like 'for voice agents' or 'for conversational flows').

DimensionReasoningScore

Specificity

The description names the domain (evaluator/test creation for voice agents) and lists several actions like 'individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions,' but these read more like a category list than concrete specific actions. It lacks verb-driven specificity like 'creates X', 'generates Y'.

2 / 3

Completeness

The description explicitly answers both 'when' (via the extensive 'Use when the user asks to...' clause with many trigger phrases) and 'what' (covers evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions, and various test types). Both dimensions are clearly addressed.

3 / 3

Trigger Term Quality

The description includes an extensive list of natural trigger phrases users would say, such as 'create an evaluator', 'test my agent', 'write a unit test for a voice agent', 'IVR navigation test', 'run evals', 'scripted scenario', and many more variations. This provides excellent coverage of natural user language.

3 / 3

Distinctiveness Conflict Risk

While the description is specific to evaluator/test creation for voice agents and IVR flows, some trigger terms like 'write a unit test', 'build a regression test', 'design a test case' are generic enough to potentially conflict with general testing skills. The voice agent and IVR-specific terms help distinguish it, but the overlap risk with general test-writing skills is notable.

2 / 3

Total

10

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (502 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
cekura-ai/cekura-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.