CtrlK
BlogDocsLog inGet started
Tessl Logo

cekura-metric-design

Use when the user asks to "create a metric", "write a metric", "design a metric", "build a metric for", "evaluate agent performance", "measure call quality", "track a KPI", "add a workflow metric", "improve my metric", "fix a metric", "debug metric results", "set up quality scoring", or "what metrics do I need". Also relevant when discussing LLM judge prompts, custom code metrics, evaluation triggers, VALID_SKIP patterns, section extraction, or metric best practices for Cekura voice AI agents. Covers both creating new metrics and reviewing, iterating on, or troubleshooting existing ones.

81

1.38x
Quality

71%

Does it follow best practices?

Impact

98%

1.38x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./cekura/skills/cekura-metric-design/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

70%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, domain-specific skill that covers a complex topic (Cekura metric design) with clear workflows, good progressive disclosure, and practical patterns. Its main weakness is that actionable, copy-paste-ready examples are deferred to reference files rather than included inline, and some sections could be tightened for conciseness. The workflow clarity is strong with explicit validation steps, cost guards, and iteration loops.

Suggestions

Include at least one complete, minimal llm_judge metric creation example inline (with the actual description field content and API call) so the skill is actionable even without loading reference files.

Tighten the 'Core Terminology' and 'Metric Types' sections by removing explanatory prose Claude can infer (e.g., 'Custom_code seems appealing for objective checks but is brittle in practice' — just state the rule).

DimensionReasoningScore

Conciseness

The skill is comprehensive but includes some unnecessary explanation (e.g., explaining what metrics are, the 'Spirit vs Letter' concept is well-explained but verbose). Some sections like 'Core Terminology' explain things Claude could infer. However, most content is domain-specific knowledge Claude wouldn't have, so the verbosity is moderate rather than severe.

2 / 3

Actionability

The skill provides good conceptual guidance with specific patterns (trigger templates, prompt structures, VALID_SKIP pattern) and concrete tables, but lacks executable code examples or copy-paste-ready metric definitions inline. It defers most concrete examples to reference files (prompt-patterns.md, examples/) which aren't provided. The trigger prompt template is one of the few concrete, usable artifacts.

2 / 3

Workflow Clarity

The metric creation workflow is clearly sequenced (6 steps) with explicit validation checkpoints (step 5: deploy and test, step 6: iterate). The 'Manual Fix First, Then Labs' section provides a clear feedback loop with specific sample sizes. The two-layer N/A strategy and cost guard (>100 calls confirmation) serve as validation checkpoints. The two-step activation requirement is explicitly called out to prevent silent failures.

3 / 3

Progressive Disclosure

Excellent progressive disclosure structure. The SKILL.md serves as a comprehensive overview with well-signaled, one-level-deep references to specific files: references/prompt-patterns.md, references/advanced-patterns.md, references/pythonic-patterns.md, references/api-reference.md, plus four example files. Each reference is contextually placed where the reader would need it, with clear descriptions of what each contains.

3 / 3

Total

10

/

12

Passed

Description

72%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at trigger term coverage and distinctiveness, providing an extensive list of natural user phrases and domain-specific terminology for Cekura voice AI agent metrics. However, it is heavily skewed toward 'when to use' at the expense of clearly explaining 'what it does' — the actual capabilities and outputs of the skill are only vaguely referenced. The description reads more like a trigger list than a balanced skill description.

Suggestions

Add a clear opening sentence describing what the skill concretely does, e.g., 'Designs, writes, and debugs quality metrics for Cekura voice AI agents, including LLM judge prompts, custom code metrics, and evaluation scoring configurations.'

Restructure to lead with specific capabilities (what it produces/outputs) before the 'Use when...' trigger list to ensure the 'what' is as strong as the 'when'.

DimensionReasoningScore

Specificity

The description mentions some actions like 'creating new metrics', 'reviewing', 'iterating on', 'troubleshooting existing ones', and references domain concepts like 'LLM judge prompts', 'custom code metrics', 'evaluation triggers', 'VALID_SKIP patterns', 'section extraction'. However, it doesn't clearly list concrete specific actions the skill performs — it's more focused on trigger terms than describing capabilities.

2 / 3

Completeness

The 'when' is extremely well-covered with explicit trigger phrases. However, the 'what does this do' is weak — the description never clearly states what the skill actually does or produces. It lists when to use it but not what concrete outputs or actions it performs beyond vague references to 'creating new metrics' and 'troubleshooting existing ones'.

2 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'create a metric', 'write a metric', 'evaluate agent performance', 'measure call quality', 'track a KPI', 'fix a metric', 'debug metric results', 'set up quality scoring', 'what metrics do I need'. These are highly natural phrases a user would actually type.

3 / 3

Distinctiveness Conflict Risk

The description is highly specific to a clear niche: metrics for Cekura voice AI agents, including domain-specific concepts like 'VALID_SKIP patterns', 'LLM judge prompts', and 'section extraction'. This is very unlikely to conflict with other skills.

3 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
cekura-ai/cekura-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.