Use when the user asks to "improve a metric", "run labs", "leave feedback on a metric", "add to labs", "fix metric accuracy", "review metric results", "find misaligned metrics", or "iterate on metric quality". Covers the metric improvement cycle, the feedback workflow, and the labs pipeline used to refine metric accuracy over time.
68
83%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Guide the metric improvement cycle: identify misaligned metric results, leave structured feedback, run the labs improvement pipeline, and validate changes. This workflow transforms metric quality from initial draft to production-ready through systematic iteration.
When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.
For edge case refinement after manual fixes are validated:
Review recent call evaluations to find suspicious results:
List recent calls (with agent filters) and retrieve specific calls to get evaluation results — see "API Endpoints Reference" below.
Look for:
To systematically find misalignment:
evaluate_calls or list_callsUse the mark_metric_vote endpoint to leave structured feedback. First retrieve the call to find the metric result, then POST the feedback (see "API Endpoints Reference" below).
Collect at least 6 feedback instances before running auto-improve. This gives labs enough signal to identify patterns in the feedback and make meaningful prompt adjustments.
Track feedback progress:
Once 6+ feedback instances are accumulated:
Trigger auto-improvement via POST /test_framework/metric-reviews/process_feedbacks/ with the metric ID.
Labs analyzes the feedback and suggests changes to the metric prompt. Review the suggested changes carefully:
Each evaluation costs the client real money. Before triggering any bulk evaluation:
page_size=1 and read the response count)Use page_size parameter (up to 200) instead of paginating through multiple pages. Use server-side filters (agent_id, project, timestamp__gte/timestamp__lte) to scope calls before evaluating.
Re-run the improved metric on the same calls that had misaligned results:
Use POST /observability/v1/call-logs/rerun_evaluation/ with the call IDs and metric ID.
Check:
If validation fails, leave additional feedback and iterate.
Once the metric prompt is validated through labs, consider converting to a Pythonic custom_code metric for production:
description fielddescription (the prompt) and custom_code (the Python wrapper)This gives the benefit of the labs-refined prompt with the performance advantage of targeted context extraction.
When the user wants to simulate the labs workflow interactively:
After improving a metric, the user typically needs:
| Endpoint | Purpose |
|---|---|
GET /observability/v1/call-logs-external/?agent=ID | List calls |
GET /observability/v1/call-logs-external/{id}/ | Get call details + evaluation results |
POST /observability/v1/call-logs-external/{id}/mark_metric_vote/ | Leave feedback |
POST /test_framework/metric-reviews/process_feedbacks/ | Run labs auto-improve (see below) |
GET /test_framework/metric-reviews/process_feedbacks_progress/ | Poll improvement progress |
POST /observability/v1/call-logs/evaluate_metrics/ | Evaluate specific metrics on calls |
POST /observability/v1/call-logs/rerun_evaluation/ | Re-run evaluation on calls |
POST /test_framework/test-sets/create_from_call_log/ | Create test set from call log |
POST /test_framework/metric-reviews/process_feedbacks/
{
"metric_id": 123,
"test_set_ids": [456, 789]
}Optional fields: metric_type (default "llm_judge"), skip_evaluation (bool), simplified_prompt (string).
Returns {"progress_id": "<uuid>"}. Poll at GET /test_framework/metric-reviews/process_feedbacks_progress/?progress_id=<uuid>.
The response includes improved description and evaluation_trigger when complete — you must PATCH the metric to apply changes (they are not auto-applied).
POST /test_framework/test-sets/create_from_call_log/
{
"call_log_id": 3358270,
"metrics": [{"metric": 123, "feedback": "The metric incorrectly failed this call because..."}]
}Note: metrics must be an array of objects [{"metric": <id>, "feedback": "<text>"}], NOT bare metric IDs. Passing bare IDs returns 500.
references/feedback-examples.md — Examples of good feedback for different metric types24ad1d0
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.