CtrlK
BlogDocsLog inGet started
Tessl Logo

online-evals

Attach judges to config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

62

Quality

72%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/agentcontrol/online-evals/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a comprehensive, highly actionable skill with excellent workflow sequencing and executable code examples. Its main weakness is length — the full Python class, two SDK examples, and detailed API calls make it verbose for a single file. Splitting the SDK examples and the Python helper class into referenced bundle files would significantly improve token efficiency and progressive disclosure.

Suggestions

Extract the Python AIConfigJudges class and SDK examples into separate bundle files (e.g., judges_helper.py, sdk_auto_eval.py, sdk_direct_eval.py) and reference them from the main skill.

Trim the 'Core Concepts' section — the 'What Are Judges?' explanation and the built-in judges table could be condensed into 2-3 lines since Claude can reference the linked documentation for details.

DimensionReasoningScore

Conciseness

The skill is fairly long (~350 lines) with some sections that could be tightened. The 'Core Concepts' section explains things like what judges are and how they work, which adds bulk. The Python class implementation is extensive and could be trimmed. However, most content is genuinely informative and not padded with basic concept explanations.

2 / 3

Actionability

Provides fully executable curl commands, complete Python class implementations, and working SDK examples with proper imports and error handling. The code is copy-paste ready with clear parameter documentation and real API endpoints.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (Step 1: Create judges → Step 2: Attach to variations → Step 3: Set fallthrough) with explicit validation notes like the important callout that the judges array replaces all existing attachments, and the critical note about turnTargetingOn not working. Error handling table and next steps provide good checkpoints.

3 / 3

Progressive Disclosure

The skill is quite long and monolithic — the full Python class implementation, two complete SDK examples, and the API reference could be split into separate files. References to related skills and external docs are well-signaled at the bottom, but the inline content is heavy for a single SKILL.md with no bundle files to offload to.

2 / 3

Total

10

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong in specificity and distinctiveness, clearly articulating concrete actions within a well-defined niche of LLM-as-a-judge evaluation. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know exactly when to select this skill. The trigger terms are somewhat specialized and could benefit from more natural user-facing keywords.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to set up automated evaluation, add judges to experiments, or configure LLM-as-a-judge scoring.'

Include more natural trigger term variations such as 'evals', 'automated evaluation', 'grading', 'scoring responses', or 'quality monitoring' to improve discoverability.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Attach judges to config variations', 'automatic LLM-as-a-judge evaluation', 'Create custom judges', 'configure sampling rates', 'monitor quality scores'.

3 / 3

Completeness

Clearly answers 'what does this do' with specific actions, but lacks an explicit 'Use when...' clause or equivalent trigger guidance, which caps this at 2 per the rubric guidelines.

2 / 3

Trigger Term Quality

Includes some relevant terms like 'judges', 'LLM-as-a-judge', 'sampling rates', 'quality scores', and 'config variations', but misses common user-facing variations like 'evaluation', 'evals', 'grading', 'scoring', or 'automated review'. The terminology is somewhat specialized.

2 / 3

Distinctiveness Conflict Risk

The description targets a very specific niche—LLM-as-a-judge evaluation with config variations, custom judges, and sampling rates—which is unlikely to conflict with other skills.

3 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
launchdarkly/ai-tooling
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.