tessl i github:muratcankoylan/Agent-Skills-for-Context-Engineering --skill evaluationThis skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.
Validation
88%| Criteria | Description | Result |
|---|---|---|
metadata_version | 'metadata' field is not a dictionary | Warning |
license_field | 'license' field is missing | Warning |
Total | 14 / 16 Passed | |
Implementation
57%This skill provides a comprehensive overview of agent evaluation concepts with good organization and structure. However, it leans too heavily on explanation rather than actionable, executable guidance. The code examples are incomplete pseudocode, and the workflow lacks explicit validation checkpoints that would be critical for building reliable evaluation pipelines.
Suggestions
Replace pseudocode with complete, executable evaluation code including actual rubric definitions and scoring logic
Add explicit validation checkpoints to the evaluation workflow (e.g., 'Verify test set coverage before running', 'Check for score distribution anomalies after evaluation')
Remove explanatory paragraphs about why evaluation is hard—focus on concrete solutions Claude can implement
Add a concrete example of an LLM-as-judge prompt template that can be directly used
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content includes some unnecessary explanatory text (e.g., explaining what non-determinism means, general descriptions of why evaluation matters) that Claude would already understand. The 95% finding table and core concepts are valuable, but sections like 'Evaluation Challenges' explain concepts at length rather than providing actionable guidance. | 2 / 3 |
Actionability | The skill provides some concrete examples (test set structure, simple evaluation function) but much of the content is descriptive rather than executable. The code examples are incomplete pseudocode (e.g., 'load_rubric()', 'assess_dimension()' are undefined). The rubric descriptions mention levels but don't provide actual scoring implementations. | 2 / 3 |
Workflow Clarity | The 'Building Evaluation Frameworks' section provides a numbered sequence, but lacks validation checkpoints and feedback loops. For a skill involving quality gates and continuous evaluation, there's no explicit guidance on what to do when evaluations fail or how to iterate on failing tests. | 2 / 3 |
Progressive Disclosure | The content is well-organized with clear sections, a single reference to detailed metrics documentation, and appropriate use of headers. The structure moves from concepts to practical guidance to examples without deeply nested references. | 3 / 3 |
Total | 9 / 12 Passed |
Activation
37%This description is essentially a list of trigger phrases without any explanation of what the skill actually does. While it excels at providing natural keywords users might say, it completely fails to describe the skill's capabilities, making it impossible for Claude to understand what actions this skill enables.
Suggestions
Add a clear 'what' statement at the beginning describing concrete capabilities (e.g., 'Creates evaluation rubrics, implements LLM-as-judge patterns, builds multi-dimensional scoring frameworks, and sets up quality gates for agent pipelines.')
Restructure to lead with capabilities, then follow with 'Use when...' clause containing the trigger terms
Include specific outputs or deliverables the skill produces (e.g., 'generates test suites', 'produces quality metrics', 'outputs evaluation reports')
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description lacks concrete actions - it only lists trigger phrases without explaining what the skill actually does. There are no specific capabilities like 'creates rubrics', 'runs evaluations', or 'generates reports'. | 1 / 3 |
Completeness | The description only addresses 'when' (trigger conditions) but completely omits 'what' - there is no explanation of what capabilities or actions this skill provides. | 1 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms users would say: 'evaluate agent performance', 'build test framework', 'measure agent quality', 'create evaluation rubrics', 'LLM-as-judge', 'agent testing', 'quality gates'. | 3 / 3 |
Distinctiveness Conflict Risk | The trigger terms are fairly specific to agent evaluation domain, but without describing actual capabilities, it's unclear how this differs from general testing or evaluation skills. Terms like 'test framework' could overlap with other testing skills. | 2 / 3 |
Total | 7 / 12 Passed |
Reviewed
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.