CtrlK
BlogDocsLog inGet started
Tessl Logo

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

91

1.59x
Quality

88%

Does it follow best practices?

Impact

97%

1.59x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that hits all the marks. It provides specific concrete actions, includes explicit trigger keywords, clearly states both what the skill does and when to use it, and has highly distinctive terminology that minimizes conflict risk with other skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'setting up prompt testing', 'creating evaluation configs (promptfooconfig.yaml)', 'writing Python custom assertions', 'implementing llm-rubric for LLM-as-judge', and 'managing few-shot examples in prompts'.

3 / 3

Completeness

Clearly answers both what ('Configures and runs LLM evaluation using Promptfoo framework' plus specific actions) and when ('Use when...' clause with explicit triggers, plus 'Triggers on keywords like...' section).

3 / 3

Trigger Term Quality

Explicitly lists natural trigger keywords users would say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison'. Also includes technical but relevant terms like 'promptfooconfig.yaml' and 'llm-rubric'.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with specific tool name 'Promptfoo', specific file format 'promptfooconfig.yaml', and domain-specific terms like 'llm-rubric' and 'LLM-as-judge' that clearly distinguish it from generic testing or evaluation skills.

3 / 3

Total

12

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent executable examples and clear workflows. The main weakness is length - it tries to be both a quick reference and comprehensive guide in one file, which impacts conciseness and progressive disclosure. The troubleshooting section with specific gotchas (maxConcurrency placement, relay API 401 errors) adds significant practical value.

Suggestions

Extract the 'Common Assertion Types' table, 'Advanced Few-Shot Implementation', and 'Long Text Handling' sections into separate reference files to improve progressive disclosure

Remove the overview sentence explaining what Promptfoo is - Claude knows this and it wastes tokens

Consider condensing the project structure example and inline comments that explain obvious concepts

DimensionReasoningScore

Conciseness

The skill is comprehensive but includes some unnecessary explanations (e.g., 'an open-source CLI tool for testing and comparing LLM outputs') and could be tightened in places. The content is mostly efficient but has verbose sections like the full project structure that could be condensed.

2 / 3

Actionability

Excellent executable examples throughout - complete YAML configs, working Python assertion functions, bash commands, and JSON prompt formats. All code is copy-paste ready with specific file paths and function signatures.

3 / 3

Workflow Clarity

Clear sequential workflows with explicit validation steps. The Quick Start provides a clear 3-step process, troubleshooting section addresses common errors with solutions, and the echo provider section explicitly describes a preview-before-execute pattern.

3 / 3

Progressive Disclosure

Good structure with clear sections, but the document is quite long (~400 lines) with detailed content that could be split into separate reference files. Only one external reference (references/promptfoo_api.md) is mentioned at the end. The assertion types table and advanced patterns could be separate files.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
daymade/claude-code-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.