This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
41
41%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation/SKILL.mdQuality
Discovery
54%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description excels at providing trigger terms and establishing a clear, distinctive niche around LLM-based evaluation, but critically fails to describe what the skill actually does. It reads entirely as a 'Use when...' clause without any preceding capability statement, making it incomplete despite strong trigger coverage.
Suggestions
Add a clear 'what it does' statement at the beginning listing concrete capabilities, e.g., 'Implements LLM-as-judge evaluation pipelines, creates scoring rubrics, performs pairwise comparisons of model outputs, and applies bias mitigation techniques.'
Restructure to follow the pattern: capabilities first, then 'Use when...' triggers, e.g., 'Builds automated evaluation systems for LLM outputs using direct scoring and pairwise comparison methods. Use when the user asks to...'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions some actions like 'implement LLM-as-judge', 'compare model outputs', 'create evaluation rubrics', and 'mitigate evaluation bias', but these are embedded within trigger phrases rather than stated as concrete capabilities the skill performs. There's no clear 'what it does' statement listing specific actions. | 2 / 3 |
Completeness | The description only addresses 'when' (trigger conditions) but completely lacks a 'what does this do' section. There is no explanation of the skill's capabilities or concrete actions it performs. Per the rubric, missing 'what' qualifies as a score of 1. | 1 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms users would say: 'LLM-as-judge', 'compare model outputs', 'evaluation rubrics', 'evaluation bias', 'direct scoring', 'pairwise comparison', 'position bias', 'evaluation pipelines', 'automated quality assessment'. These are terms practitioners in this domain would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | The description targets a very specific niche—LLM-based evaluation and judging—with highly distinctive trigger terms like 'LLM-as-judge', 'pairwise comparison', 'position bias', and 'evaluation rubrics' that are unlikely to conflict with other skills. | 3 / 3 |
Total | 9 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is comprehensive in coverage but severely over-engineered for a SKILL.md file. It reads more like a tutorial or survey paper than an actionable skill, explaining many concepts Claude already understands (evaluation taxonomy, statistical metrics, bias types) at length. The lack of executable code and the monolithic structure significantly reduce its practical utility as a skill file.
Suggestions
Cut the content by 60-70%: remove the bias landscape descriptions (Claude knows these), the metric selection table, the 'When to Use' section, and the ASCII pipeline diagram. Focus on the novel patterns: prompt templates, position-swap protocol, and rubric generation format.
Extract detailed content into bundle files: move the full examples, anti-patterns catalog, metric selection guide, and scaling strategies into separate referenced files (e.g., EXAMPLES.md, ANTI_PATTERNS.md, METRICS.md).
Add executable implementation code: provide actual Python functions that implement the evaluation pipeline (e.g., a `run_pairwise_comparison()` function that calls an LLM API, swaps positions, and aggregates results) rather than just prompt templates with placeholders.
Remove the metadata section, limitations boilerplate, and 'When to Use' trigger list—these belong in YAML frontmatter, not the skill body.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~350+ lines. It explains concepts Claude already knows well (what position bias is, what Spearman's ρ measures, basic evaluation taxonomy). The 'Key insight' callout, the 'When to Use' section listing 7 bullet points, the bias landscape descriptions, and the metric selection table all add significant token overhead for information Claude can infer or already possesses. The ASCII pipeline diagram is particularly wasteful. | 1 / 3 |
Actionability | The skill provides prompt templates and JSON output examples, which is useful. However, none of the code is truly executable—the prompt templates use placeholder syntax ({prompt}, {response}) without a surrounding implementation, and there's no actual Python/code that could be run to build an evaluation pipeline. The examples show input/output pairs but not the code that produces them. | 2 / 3 |
Workflow Clarity | The pairwise comparison position-swap protocol is well-sequenced with clear steps (1-4) and includes a consistency check. However, the overall evaluation pipeline is described as an ASCII diagram rather than actionable steps, and there are no explicit validation checkpoints for the broader pipeline design (e.g., how to verify rubric quality, how to validate that bias mitigation is working). The decision framework is helpful but the pipeline lacks error recovery guidance. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with no bundle files to reference. All content—taxonomy, bias landscape, metric selection, implementation details, examples, anti-patterns, scaling guidance, references—is crammed into a single file. The 'Integration' and 'References' sections mention other skills and external links but don't offload any content. Content like the full rubric generation example, the metric selection table, and the bias catalog should be in separate reference files. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
a4d3e3a
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.