evaluation

tessl i github:muratcankoylan/Agent-Skills-for-Context-Engineering --skill evaluation

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.

58%

Overall

Validation — 88%

Implementation — 57%

Activation — 37%

SKILL.md

Review

Evals

Validation

88%

Warnings & errors only

Criteria	Description	Result
metadata_version	'metadata' field is not a dictionary	Warning
license_field	'license' field is missing	Warning

	Total	14 / 16 Passed

Implementation

57%

This skill provides a comprehensive overview of agent evaluation concepts with good organization and structure. However, it leans too heavily on explanation rather than actionable, executable guidance. The code examples are incomplete pseudocode, and the workflow lacks explicit validation checkpoints that would be critical for building reliable evaluation pipelines.

Suggestions

Replace pseudocode with complete, executable evaluation code including actual rubric definitions and scoring logic

Add explicit validation checkpoints to the evaluation workflow (e.g., 'Verify test set coverage before running', 'Check for score distribution anomalies after evaluation')

Remove explanatory paragraphs about why evaluation is hard—focus on concrete solutions Claude can implement

Add a concrete example of an LLM-as-judge prompt template that can be directly used

Dimension	Reasoning	Score
Conciseness	The content includes some unnecessary explanatory text (e.g., explaining what non-determinism means, general descriptions of why evaluation matters) that Claude would already understand. The 95% finding table and core concepts are valuable, but sections like 'Evaluation Challenges' explain concepts at length rather than providing actionable guidance.	2 / 3
Actionability	The skill provides some concrete examples (test set structure, simple evaluation function) but much of the content is descriptive rather than executable. The code examples are incomplete pseudocode (e.g., 'load_rubric()', 'assess_dimension()' are undefined). The rubric descriptions mention levels but don't provide actual scoring implementations.	2 / 3
Workflow Clarity	The 'Building Evaluation Frameworks' section provides a numbered sequence, but lacks validation checkpoints and feedback loops. For a skill involving quality gates and continuous evaluation, there's no explicit guidance on what to do when evaluations fail or how to iterate on failing tests.	2 / 3
Progressive Disclosure	The content is well-organized with clear sections, a single reference to detailed metrics documentation, and appropriate use of headers. The structure moves from concepts to practical guidance to examples without deeply nested references.	3 / 3
	Total	9 / 12 Passed

Activation

37%

This description is essentially a list of trigger phrases without any explanation of what the skill actually does. While it excels at providing natural keywords users might say, it completely fails to describe the skill's capabilities, making it impossible for Claude to understand what actions this skill enables.

Suggestions

Add a clear 'what' statement at the beginning describing concrete capabilities (e.g., 'Creates evaluation rubrics, implements LLM-as-judge patterns, builds multi-dimensional scoring frameworks, and sets up quality gates for agent pipelines.')

Restructure to lead with capabilities, then follow with 'Use when...' clause containing the trigger terms

Include specific outputs or deliverables the skill produces (e.g., 'generates test suites', 'produces quality metrics', 'outputs evaluation reports')

Dimension	Reasoning	Score
Specificity	The description lacks concrete actions - it only lists trigger phrases without explaining what the skill actually does. There are no specific capabilities like 'creates rubrics', 'runs evaluations', or 'generates reports'.	1 / 3
Completeness	The description only addresses 'when' (trigger conditions) but completely omits 'what' - there is no explanation of what capabilities or actions this skill provides.	1 / 3
Trigger Term Quality	Excellent coverage of natural trigger terms users would say: 'evaluate agent performance', 'build test framework', 'measure agent quality', 'create evaluation rubrics', 'LLM-as-judge', 'agent testing', 'quality gates'.	3 / 3
Distinctiveness Conflict Risk	The trigger terms are fairly specific to agent evaluation domain, but without describing actual capabilities, it's unclear how this differs from general testing or evaluation skills. Terms like 'test framework' could overlap with other testing skills.	2 / 3
	Total	7 / 12 Passed

Reviewed

16 days ago

Table of Contents

Validation Implementation Activation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.