CtrlK
BlogDocsLog inGet started
Tessl Logo

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

41

Quality

41%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

54%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at providing trigger terms and establishing a clear, distinctive niche around LLM-based evaluation, but critically fails to describe what the skill actually does. It reads entirely as a 'Use when...' clause without any preceding capability statement, making it incomplete despite strong trigger coverage.

Suggestions

Add a clear 'what it does' statement at the beginning listing concrete capabilities, e.g., 'Implements LLM-as-judge evaluation pipelines, creates scoring rubrics, performs pairwise comparisons of model outputs, and applies bias mitigation techniques.'

Restructure to follow the pattern: capabilities first, then 'Use when...' triggers, e.g., 'Builds automated evaluation systems for LLM outputs using direct scoring and pairwise comparison methods. Use when the user asks to...'

DimensionReasoningScore

Specificity

The description mentions some actions like 'implement LLM-as-judge', 'compare model outputs', 'create evaluation rubrics', and 'mitigate evaluation bias', but these are embedded within trigger phrases rather than stated as concrete capabilities the skill performs. There's no clear 'what it does' statement listing specific actions.

2 / 3

Completeness

The description only addresses 'when' (trigger conditions) but completely lacks a 'what does this do' section. There is no explanation of the skill's capabilities or concrete actions it performs. Per the rubric, missing 'what' qualifies as a score of 1.

1 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'LLM-as-judge', 'compare model outputs', 'evaluation rubrics', 'evaluation bias', 'direct scoring', 'pairwise comparison', 'position bias', 'evaluation pipelines', 'automated quality assessment'. These are terms practitioners in this domain would naturally use.

3 / 3

Distinctiveness Conflict Risk

The description targets a very specific niche—LLM-based evaluation and judging—with highly distinctive trigger terms like 'LLM-as-judge', 'pairwise comparison', 'position bias', and 'evaluation rubrics' that are unlikely to conflict with other skills.

3 / 3

Total

9

/

12

Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is comprehensive in coverage but severely over-engineered for a SKILL.md file. It reads more like a tutorial or survey paper than an actionable skill, explaining many concepts Claude already understands (evaluation taxonomy, statistical metrics, bias types) at length. The lack of executable code and the monolithic structure significantly reduce its practical utility as a skill file.

Suggestions

Cut the content by 60-70%: remove the bias landscape descriptions (Claude knows these), the metric selection table, the 'When to Use' section, and the ASCII pipeline diagram. Focus on the novel patterns: prompt templates, position-swap protocol, and rubric generation format.

Extract detailed content into bundle files: move the full examples, anti-patterns catalog, metric selection guide, and scaling strategies into separate referenced files (e.g., EXAMPLES.md, ANTI_PATTERNS.md, METRICS.md).

Add executable implementation code: provide actual Python functions that implement the evaluation pipeline (e.g., a `run_pairwise_comparison()` function that calls an LLM API, swaps positions, and aggregates results) rather than just prompt templates with placeholders.

Remove the metadata section, limitations boilerplate, and 'When to Use' trigger list—these belong in YAML frontmatter, not the skill body.

DimensionReasoningScore

Conciseness

The skill is extremely verbose at ~350+ lines. It explains concepts Claude already knows well (what position bias is, what Spearman's ρ measures, basic evaluation taxonomy). The 'Key insight' callout, the 'When to Use' section listing 7 bullet points, the bias landscape descriptions, and the metric selection table all add significant token overhead for information Claude can infer or already possesses. The ASCII pipeline diagram is particularly wasteful.

1 / 3

Actionability

The skill provides prompt templates and JSON output examples, which is useful. However, none of the code is truly executable—the prompt templates use placeholder syntax ({prompt}, {response}) without a surrounding implementation, and there's no actual Python/code that could be run to build an evaluation pipeline. The examples show input/output pairs but not the code that produces them.

2 / 3

Workflow Clarity

The pairwise comparison position-swap protocol is well-sequenced with clear steps (1-4) and includes a consistency check. However, the overall evaluation pipeline is described as an ASCII diagram rather than actionable steps, and there are no explicit validation checkpoints for the broader pipeline design (e.g., how to verify rubric quality, how to validate that bias mitigation is working). The decision framework is helpful but the pipeline lacks error recovery guidance.

2 / 3

Progressive Disclosure

This is a monolithic wall of text with no bundle files to reference. All content—taxonomy, bias landscape, metric selection, implementation details, examples, anti-patterns, scaling guidance, references—is crammed into a single file. The 'Integration' and 'References' sections mention other skills and external links but don't offload any content. Content like the full rubric generation example, the metric selection table, and the bias catalog should be in separate reference files.

1 / 3

Total

6

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
sickn33/antigravity-awesome-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.