online-evals

Attach judges to config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

Quality

72%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Risky

Do not use without reviewing

Fix and improve this skill with Tessl

tessl review fix ./skills/agentcontrol/online-evals/SKILL.md

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with clear workflows and executable examples covering both REST API and SDK approaches. Its main weakness is length — the full Python class and two complete SDK examples make it verbose for a single file, and some content could be offloaded to supporting bundle files. The error handling table and important callouts (replace semantics, fallthrough requirement) are valuable additions.

Suggestions

Move the full AIConfigJudges Python class and SDK examples into separate bundle files (e.g., examples/judges_manager.py, examples/auto_eval.py) and reference them from SKILL.md to improve progressive disclosure and reduce token cost.

Trim the 'Core Concepts' section — the 'What Are Judges?' explanation and the restrictions list could be condensed into a few bullet points since Claude can infer most of this from the API examples.

Dimension	Reasoning	Score
Conciseness	The skill is fairly comprehensive but includes some unnecessary verbosity — the full Python class implementation (~80 lines) could be trimmed or moved to a reference file, and sections like 'Core Concepts' explain things that could be more concise. The SDK examples are lengthy but mostly justified given the complexity.	2 / 3
Actionability	Provides fully executable curl commands, complete Python class implementation, and working SDK examples with proper imports and error handling. The code is copy-paste ready with clear parameter descriptions and real API endpoints.	3 / 3
Workflow Clarity	The workflow is clearly sequenced (Step 1: Create judges → Step 2: Attach to variations → Step 3: Set fallthrough), with important callouts like the warning that the judges array replaces all existing attachments, the note about turnTargetingOn not working, and explicit error handling table. The 'Next Steps' section provides a clear post-workflow checklist.	3 / 3
Progressive Disclosure	The content is quite long and monolithic — the full Python class, both SDK examples, and the API reference could be split into separate files. While it has good section headers and references to related skills and external docs, the inline content is heavy for a single SKILL.md with no bundle files to offload to.	2 / 3
	Total	10 / 12 Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong in specificity and distinctiveness, clearly articulating concrete actions within a well-defined niche of LLM-as-a-judge evaluation. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know exactly when to select this skill. The trigger terms are somewhat specialized and could benefit from more natural user-facing language.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to set up automated evaluation, judge prompts, or score LLM outputs against quality criteria.'

Include more natural trigger term variations such as 'evaluation', 'auto-eval', 'grading', 'prompt scoring', or 'output quality' to improve matching with user requests.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Attach judges to config variations', 'automatic LLM-as-a-judge evaluation', 'Create custom judges', 'configure sampling rates', 'monitor quality scores'.	3 / 3
Completeness	Clearly answers 'what does this do' with specific actions, but lacks an explicit 'Use when...' clause or equivalent trigger guidance, which caps this dimension at 2 per the rubric guidelines.	2 / 3
Trigger Term Quality	Includes some relevant terms like 'judges', 'LLM-as-a-judge', 'sampling rates', 'quality scores', and 'config variations', but misses common user-facing variations like 'evaluation', 'grading', 'scoring', 'auto-eval', or 'prompt evaluation'. The terminology is somewhat specialized.	2 / 3
Distinctiveness Conflict Risk	The description targets a very specific niche—LLM-as-a-judge evaluation with config variations, custom judges, and sampling rates—which is unlikely to conflict with other skills.	3 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: launchdarkly/ai-tooling
Commit: 913b745

Reviewed: 10 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.