cekura-eval-design

Use when the user asks to "create an evaluator", "create evals", "create a scenario", "write a test scenario", "design a test case", "test my agent", "build eval coverage", "plan a test suite", "create red team tests", "set up test profiles", "configure conditional actions", "write a conditional action evaluator", "build a deterministic test", "design an IVR test", "IVR navigation test", "write a unit test for a voice agent", "build a regression test", "scripted scenario", "scripted voice test", "structured evaluator", "exact flow test", "sequential conditions", "fixed sequence test", or "run evals". Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions (deterministic / unit test / regression / IVR navigation flows), and best practices for workflow / red-team / edge-case / deterministic test types.

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with concrete endpoints, schemas, and worked payloads, and its workflows carry explicit validation checkpoints and feedback loops. Its main weaknesses are mild internal repetition that bloats an already long document and three cited example files that do not exist in the bundle.

Suggestions

Create the missing examples/ directory with csv-eval-creation.md, workflow-eval.md, and red-team-eval.md, or remove those entries from 'Additional Resources' to eliminate dangling references.

De-duplicate the conditional-actions trigger phrases and the authoring-mode guidance so they appear authoritatively once (in 'Choosing Authoring Mode') and the 'Pre-Creation Checkpoint' references rather than restates them.

Tighten the 'Concrete examples (which mode for which scenario)' table or fold its most load-bearing rows into the prose to reduce length without losing the mode-selection guidance.

Dimension	Reasoning	Score
Conciseness	The body is domain-dense and assumes Claude's intelligence (no generic API/TTS primers), but at ~480 lines it repeats material — the conditional-actions trigger phrase list appears both in 'Choosing Authoring Mode' and again in the 'Pre-Creation Checkpoint' authoring-mode item, and the worked examples table largely restates the surrounding prose. Matches the score-2 'mostly efficient but could be tightened'; not score 3 because the duplication is genuine tightening opportunity, and not score 1 because most tokens convey non-obvious Cekura specifics.	2 / 3
Actionability	Provides exact endpoints ('POST /test_framework/v1/scenarios/generate-bg/'), full schema field tables, copy-paste JSON payloads, concrete tool IDs ('TOOL_END_CALL', 'TOOL_DTMF'), template syntax ('{{test_profile.field_name}}'), and a worked verification-flow example. Matches the score-3 anchor for fully executable, copy-paste-ready guidance.	3 / 3
Workflow Clarity	The 10-step 'Eval Design Workflow' and the 7-step conditional-actions authoring sequence are clearly numbered, with explicit validation steps ('Run the validation checklist') and a 'Pre-Creation Checkpoint'. Feedback loops are present for risky/batch ops (partial generation -> smaller batch; validate -> fix -> re-validate). Matches the score-3 anchor for clear sequence with explicit validation and error-recovery loops.	3 / 3
Progressive Disclosure	The SKILL.md is a well-signaled overview pointing one level deep to real reference files (all six cited references/*.md exist, mapped in 'Additional Resources'), but the body also cites examples/csv-eval-creation.md, examples/workflow-eval.md, and examples/red-team-eval.md in a directory that does not exist. Matches the score-2 anchor of structure with an organization defect; not score 3 because the dangling examples/ references break navigation, and not score 1 because references are single-level and clearly signaled rather than deeply nested.	2 / 3
	Total	10 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is a strong, trigger-rich statement of both what the skill covers and when to use it, written in third person. It names many concrete actions and natural user phrasings with a clear, non-overlapping niche.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions such as 'create an evaluator', 'design a test case', 'configure conditional actions', 'build a deterministic test', and 'design an IVR test', plus domain coverage of 'evaluator design, suite coverage strategy, test profiles, mock-tool data design'. Matches the score-3 anchor for several specific concrete actions; not score 2 because the action list is comprehensive rather than partial.	3 / 3
Completeness	Explicitly answers 'what' ('Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions...') and 'when' via the 'Use when the user asks to...' trigger clause. Matches the score-3 anchor answering both what AND when with explicit triggers.	3 / 3
Trigger Term Quality	Packed with natural phrasings users would say — 'create evals', 'test my agent', 'build eval coverage', 'plan a test suite', 'create red team tests', 'IVR navigation test', 'run evals'. Matches the score-3 anchor for good coverage of natural terms; not score 2 because both common and varied phrasings are present.	3 / 3
Distinctiveness Conflict Risk	Occupies a clear Cekura-eval-design niche with highly specific triggers (conditional actions, IVR navigation, deterministic/regression tests) and explicit boundaries against sibling skills. Matches the score-3 anchor for a distinct niche unlikely to conflict.	3 / 3
	Total	12 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (502 lines); consider splitting into references/ and linking	Warning

	Total	15 / 16 Passed

Repository: cekura-ai/cekura-skills
Commit: f0854af

Reviewed: about 23 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.