promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

Quality

86%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that hits all the marks. It provides specific concrete actions, includes explicit trigger keywords, clearly answers both what and when, and uses distinctive terminology that minimizes conflict risk with other skills. The description uses proper third-person voice throughout.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'setting up prompt testing', 'creating evaluation configs (promptfooconfig.yaml)', 'writing Python custom assertions', 'implementing llm-rubric for LLM-as-judge', 'managing few-shot examples in prompts'.	3 / 3
Completeness	Clearly answers both what ('Configures and runs LLM evaluation using Promptfoo framework') and when ('Use when setting up prompt testing...') with explicit trigger guidance including a 'Triggers on keywords' clause.	3 / 3
Trigger Term Quality	Explicitly lists natural trigger keywords users would say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison'. Also includes technical terms like 'promptfooconfig.yaml' and 'llm-rubric' that users familiar with the tool would use.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive with specific tool name 'Promptfoo', specific file format 'promptfooconfig.yaml', and domain-specific terms like 'llm-rubric' and 'LLM-as-judge' that clearly distinguish it from generic testing or evaluation skills.	3 / 3
	Total	12 / 12 Passed

Implementation

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, actionable skill with excellent executable examples covering the full Promptfoo workflow. The main weaknesses are some unnecessary explanatory content (the overview paragraph, the real-world example with local paths) and missing validation checkpoints in workflows - particularly important given that LLM evaluations can be expensive API operations.

Suggestions

Remove the Overview paragraph - Claude knows what Promptfoo is; start directly with Quick Start

Add a validation step before running eval: suggest using echo provider first to verify config, or add 'npx promptfoo@latest validate' if available

Remove or generalize the 'Real-World Example' section - the local path (/Users/tiansheng/...) is not useful and the structure is already shown elsewhere

Dimension	Reasoning	Score
Conciseness	The skill is mostly efficient with good code examples, but includes some unnecessary sections like the 'Overview' paragraph explaining what Promptfoo is (Claude knows this), and the 'Real-World Example' section with a specific local path that adds little value.	2 / 3
Actionability	Excellent executable code throughout - complete YAML configs, Python assertion functions with proper return types, bash commands, and JSON prompt formats. All examples are copy-paste ready with realistic patterns.	3 / 3
Workflow Clarity	The Quick Start provides a clear sequence, but the skill lacks explicit validation checkpoints. For example, there's no guidance on verifying config syntax before running expensive API calls, or validating Python assertions work before full evaluation runs.	2 / 3
Progressive Disclosure	Well-structured with clear sections progressing from Quick Start to Core Configuration to Advanced patterns. References external file (references/promptfoo_api.md) appropriately for detailed API docs, keeping the main skill focused.	3 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: fernandezbaptiste/claude-code-skills
Commit: 4f0eae8

Reviewed: 3 months ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.