advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Quality

47%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation/SKILL.md

Quality

Content

39%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like an academic survey paper than an actionable skill file. While the workflow clarity is strong with good sequencing and decision frameworks, the content is severely bloated with conceptual explanations Claude doesn't need (bias definitions, metric descriptions, scale theory) and lacks executable code implementations. The monolithic structure with no supporting bundle files means everything is crammed into one oversized document.

Suggestions

Cut the 'Core Concepts' section by 70%—remove explanations of what biases are and instead provide only the mitigation patterns as terse bullet points

Replace prompt templates with actual executable Python code showing a complete evaluation function using an LLM API (e.g., a working direct_score() and pairwise_compare() function)

Split detailed content into bundle files: move the full examples into EXAMPLES.md, the bias mitigation details into BIAS_MITIGATION.md, and the metric selection table into METRICS.md, keeping only a concise overview with links in SKILL.md

Remove the metadata section, limitations boilerplate, and 'When to Use' trigger list—these belong in frontmatter, not the body content

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose at ~350+ lines, explaining many concepts Claude already knows (what position bias is, what Likert scales are, basic metric definitions). The 'Core Concepts' section reads like a textbook chapter rather than actionable instructions. The bias landscape, metric selection table, and extensive conceptual framing add significant token cost without proportional value.	1 / 3
Actionability	The skill provides prompt templates and JSON output examples which are somewhat actionable, but the code is mostly pseudocode/templates with placeholders rather than executable code. There are no actual implementation examples in a programming language—no Python functions, no API calls, no runnable evaluation pipeline code. The prompt structures are useful but not copy-paste ready for any specific framework.	2 / 3
Workflow Clarity	The evaluation pipeline is clearly sequenced with an ASCII diagram showing the flow from input through criteria loading, scoring, bias mitigation, confidence scoring, to output. The pairwise comparison protocol has explicit numbered steps with a consistency check/feedback loop. The decision tree for choosing between approaches is well-structured. Anti-patterns section provides clear problem-solution pairs.	3 / 3
Progressive Disclosure	The content is a monolithic wall of text with no bundle files to reference. Everything is inline in a single massive document—the bias landscape, metric selection framework, all implementation patterns, all examples, and references. The 'Integration' and 'References' sections mention other files but none are provided. Content like the full rubric generation example and detailed bias descriptions could easily be split into separate reference files.	1 / 3
	Total	7 / 12 Passed

Description

54%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is essentially all trigger terms with no capability statement, making it a 'when without what' case. While the trigger terms are excellent and highly distinctive for the LLM evaluation domain, the complete absence of a description of what the skill actually does (e.g., what outputs it produces, what techniques it teaches) is a critical gap that would leave Claude unable to fully understand the skill's purpose.

Suggestions

Add a capability statement before the trigger clause, e.g., 'Guides implementation of LLM-as-judge evaluation systems, including designing scoring rubrics, building pairwise comparison pipelines, and applying debiasing techniques for automated quality assessment.'

Restructure to follow the 'what + when' pattern: lead with concrete actions the skill performs (creates evaluation pipelines, generates rubrics, implements bias mitigation strategies), then follow with the existing 'Use when...' trigger clause.

Dimension	Reasoning	Score
Specificity	The description mentions some actions like 'implement LLM-as-judge', 'compare model outputs', 'create evaluation rubrics', and 'mitigate evaluation bias', but these are embedded within trigger phrases rather than stated as concrete capabilities the skill performs. There's no clear 'what it does' statement listing specific actions.	2 / 3
Completeness	The description only addresses 'when' (trigger conditions) but completely lacks a 'what does this do' section. There is no explanation of the skill's capabilities or concrete actions it performs. The entire description is a 'Use when...' clause with no preceding capability statement.	1 / 3
Trigger Term Quality	Excellent coverage of natural trigger terms users would say: 'LLM-as-judge', 'compare model outputs', 'evaluation rubrics', 'evaluation bias', 'direct scoring', 'pairwise comparison', 'position bias', 'evaluation pipelines', 'automated quality assessment'. These are terms practitioners in this domain would naturally use.	3 / 3
Distinctiveness Conflict Risk	The description targets a very specific niche—LLM-as-judge evaluation patterns—with highly distinctive trigger terms like 'position bias', 'pairwise comparison', and 'LLM-as-judge' that are unlikely to conflict with other skills.	3 / 3
	Total	9 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: sickn33/antigravity-awesome-skills
Commit: be20b37

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.