advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Quality

47%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/advanced-evaluation/SKILL.md

Quality

Discovery

54%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at trigger term coverage and distinctiveness, clearly carving out a niche around LLM-based evaluation. However, it critically lacks a 'what does this do' component — it's entirely structured as a 'Use when...' clause without first describing the skill's concrete capabilities. This inverted structure means Claude knows when to select it but not what it actually does.

Suggestions

Add a leading capability statement before the trigger clause, e.g., 'Implements LLM-as-judge evaluation pipelines, creates scoring rubrics, performs pairwise model comparisons, and applies bias mitigation techniques for automated quality assessment.'

Restructure to follow the 'what + when' pattern: start with concrete actions the skill performs (e.g., 'Designs evaluation rubrics, builds direct scoring and pairwise comparison systems, mitigates position bias in LLM evaluations'), then follow with the 'Use when...' clause.

Dimension	Reasoning	Score
Specificity	The description mentions some actions like 'implement LLM-as-judge', 'compare model outputs', 'create evaluation rubrics', and 'mitigate evaluation bias', but these are embedded in a 'Use when' clause rather than stated as concrete capabilities the skill performs. There's no clear 'what does this do' section listing specific actions.	2 / 3
Completeness	The description is entirely a 'when to use' clause with no explicit 'what does this do' statement. It lacks a clear description of the skill's capabilities and concrete actions it performs, only listing trigger scenarios.	1 / 3
Trigger Term Quality	Excellent coverage of natural trigger terms users would say: 'LLM-as-judge', 'compare model outputs', 'evaluation rubrics', 'evaluation bias', 'direct scoring', 'pairwise comparison', 'position bias', 'evaluation pipelines', 'automated quality assessment'. These are terms practitioners in this domain would naturally use.	3 / 3
Distinctiveness Conflict Risk	The description targets a very specific niche — LLM-as-judge evaluation patterns — with highly distinctive trigger terms like 'position bias', 'pairwise comparison', and 'LLM-as-judge' that are unlikely to conflict with other skills.	3 / 3
	Total	9 / 12 Passed

Implementation

39%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is comprehensive in coverage but severely over-engineered for a SKILL.md file. It reads more like an academic survey or tutorial than a concise, actionable reference—explaining concepts Claude already understands (bias types, statistical metrics, Likert scales) at length. The workflow clarity is genuinely strong with good sequencing and validation steps, but the content desperately needs to be split into an overview with linked reference files rather than presented as a single monolithic document.

Suggestions

Cut the content by 60-70%: remove explanations of concepts Claude already knows (what position bias is, what F1 score means, what Likert scales are) and keep only the actionable patterns and templates.

Split into SKILL.md (concise overview + decision framework + quick-start templates) with linked files like BIAS-MITIGATION.md, RUBRIC-PATTERNS.md, and METRICS-REFERENCE.md for detailed content.

Replace prompt template placeholders with fully executable Python code showing actual API calls (e.g., using the Anthropic SDK) to make the skill truly copy-paste actionable.

Remove the 'When to Use' section entirely—the YAML frontmatter triggers handle this, and listing obvious activation phrases wastes tokens.

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose at ~350+ lines. It explains concepts Claude already knows well (what position bias is, what Likert scales are, basic metric definitions like F1/precision/recall). The 'Key insight' callout, the 'When to Use' section listing obvious triggers, and extensive conceptual explanations of bias types all consume tokens without adding actionable value. The evaluation taxonomy section reads like a textbook rather than a reference.	1 / 3
Actionability	The skill provides prompt templates and JSON output examples which are somewhat actionable, but they are pseudocode-level templates with placeholders rather than fully executable code. There are no actual runnable implementations—no Python functions, no API calls, no concrete evaluation pipeline code. The prompt structures are useful but remain templates rather than copy-paste ready solutions.	2 / 3
Workflow Clarity	The pairwise comparison position bias mitigation protocol is clearly sequenced with explicit validation (consistency check between passes). The evaluation pipeline diagram shows clear flow. The decision tree for choosing direct vs. pairwise is well-structured. The anti-patterns section effectively highlights failure modes with solutions. The position swap example demonstrates a complete feedback loop with error recovery (disagreement → TIE).	3 / 3
Progressive Disclosure	The skill is a monolithic wall of text with no content split into separate files. Despite its length (~350+ lines of dense content), everything is inline. The 'Integration' and 'References' sections mention other files but the core content—rubric generation details, bias mitigation techniques, metric selection—could all be separate reference documents linked from a concise overview. The skill would benefit enormously from splitting into overview + detailed reference files.	1 / 3
	Total	7 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: sickn33/antigravity-awesome-skills
Commit: 1a9f5ac

Reviewed: 2 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.