CtrlK
BlogDocsLog inGet started
Tessl Logo

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

45

Quality

47%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

39%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like an academic survey paper than an actionable skill file. While the workflow clarity is strong with good sequencing and decision frameworks, the content is severely bloated with conceptual explanations Claude doesn't need (bias definitions, metric descriptions, scale theory) and lacks executable code implementations. The monolithic structure with no supporting bundle files means everything is crammed into one oversized document.

Suggestions

Cut the 'Core Concepts' section by 70%—remove explanations of what biases are and instead provide only the mitigation patterns as terse bullet points

Replace prompt templates with actual executable Python code showing a complete evaluation function using an LLM API (e.g., a working direct_score() and pairwise_compare() function)

Split detailed content into bundle files: move the full examples into EXAMPLES.md, the bias mitigation details into BIAS_MITIGATION.md, and the metric selection table into METRICS.md, keeping only a concise overview with links in SKILL.md

Remove the metadata section, limitations boilerplate, and 'When to Use' trigger list—these belong in frontmatter, not the body content

DimensionReasoningScore

Conciseness

The skill is extremely verbose at ~350+ lines, explaining many concepts Claude already knows (what position bias is, what Likert scales are, basic metric definitions). The 'Core Concepts' section reads like a textbook chapter rather than actionable instructions. The bias landscape, metric selection table, and extensive conceptual framing add significant token cost without proportional value.

1 / 3

Actionability

The skill provides prompt templates and JSON output examples which are somewhat actionable, but the code is mostly pseudocode/templates with placeholders rather than executable code. There are no actual implementation examples in a programming language—no Python functions, no API calls, no runnable evaluation pipeline code. The prompt structures are useful but not copy-paste ready for any specific framework.

2 / 3

Workflow Clarity

The evaluation pipeline is clearly sequenced with an ASCII diagram showing the flow from input through criteria loading, scoring, bias mitigation, confidence scoring, to output. The pairwise comparison protocol has explicit numbered steps with a consistency check/feedback loop. The decision tree for choosing between approaches is well-structured. Anti-patterns section provides clear problem-solution pairs.

3 / 3

Progressive Disclosure

The content is a monolithic wall of text with no bundle files to reference. Everything is inline in a single massive document—the bias landscape, metric selection framework, all implementation patterns, all examples, and references. The 'Integration' and 'References' sections mention other files but none are provided. Content like the full rubric generation example and detailed bias descriptions could easily be split into separate reference files.

1 / 3

Total

7

/

12

Passed

Description

54%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is essentially all trigger terms with no capability statement, making it a 'when without what' case. While the trigger terms are excellent and highly distinctive for the LLM evaluation domain, the complete absence of a description of what the skill actually does (e.g., what outputs it produces, what techniques it teaches) is a critical gap that would leave Claude unable to fully understand the skill's purpose.

Suggestions

Add a capability statement before the trigger clause, e.g., 'Guides implementation of LLM-as-judge evaluation systems, including designing scoring rubrics, building pairwise comparison pipelines, and applying debiasing techniques for automated quality assessment.'

Restructure to follow the 'what + when' pattern: lead with concrete actions the skill performs (creates evaluation pipelines, generates rubrics, implements bias mitigation strategies), then follow with the existing 'Use when...' trigger clause.

DimensionReasoningScore

Specificity

The description mentions some actions like 'implement LLM-as-judge', 'compare model outputs', 'create evaluation rubrics', and 'mitigate evaluation bias', but these are embedded within trigger phrases rather than stated as concrete capabilities the skill performs. There's no clear 'what it does' statement listing specific actions.

2 / 3

Completeness

The description only addresses 'when' (trigger conditions) but completely lacks a 'what does this do' section. There is no explanation of the skill's capabilities or concrete actions it performs. The entire description is a 'Use when...' clause with no preceding capability statement.

1 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'LLM-as-judge', 'compare model outputs', 'evaluation rubrics', 'evaluation bias', 'direct scoring', 'pairwise comparison', 'position bias', 'evaluation pipelines', 'automated quality assessment'. These are terms practitioners in this domain would naturally use.

3 / 3

Distinctiveness Conflict Risk

The description targets a very specific niche—LLM-as-judge evaluation patterns—with highly distinctive trigger terms like 'position bias', 'pairwise comparison', and 'LLM-as-judge' that are unlikely to conflict with other skills.

3 / 3

Total

9

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
sickn33/antigravity-awesome-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.