promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

1.59x

Quality

82%

Does it follow best practices?

Impact

97%

1.59x

Average score across 3 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly identifies the tool (Promptfoo), lists specific concrete actions, provides explicit 'Use when' guidance with multiple trigger scenarios, and enumerates natural keywords. It uses proper third-person voice throughout and is concise without being vague. The description would allow Claude to confidently select this skill from a large pool without ambiguity.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: configuring and running LLM evaluation, setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, and managing few-shot examples in prompts.	3 / 3
Completeness	Clearly answers both 'what' (configures and runs LLM evaluation using Promptfoo framework, with specific sub-tasks listed) and 'when' (explicit 'Use when...' clause with trigger scenarios and a 'Triggers on keywords' clause listing specific terms).	3 / 3
Trigger Term Quality	Includes a strong set of natural trigger terms users would actually say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison', plus specific artifacts like 'promptfooconfig.yaml', 'llm-rubric', and 'custom assertions'.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive with a clear niche around the Promptfoo framework specifically. Terms like 'promptfoo', 'promptfooconfig.yaml', 'llm-rubric', and 'LLM-as-judge' are very specific and unlikely to conflict with other skills.	3 / 3
	Total	12 / 12 Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, highly actionable skill with excellent concrete examples covering Promptfoo configuration, custom assertions, and common gotchas. Its main weaknesses are length—several advanced sections (long text handling, real-world example, advanced few-shot) should be in referenced files rather than inline—and the lack of explicit validation checkpoints in the evaluation workflow. The troubleshooting section captures real pain points (maxConcurrency placement, relay 401 errors) which adds genuine value.

Suggestions

Move the 'Long Text Handling', 'Advanced Few-Shot Implementation', and 'Real-World Example' sections into separate referenced files to reduce the main SKILL.md to a concise overview with pointers

Add an explicit workflow section with validation checkpoints: e.g., '1. Write config → 2. Preview with echo provider → 3. Run single test case → 4. Check results → 5. Run full eval'

Remove the overview sentence explaining what Promptfoo is ('an open-source CLI tool for testing and comparing LLM outputs')—Claude doesn't need this context

Dimension	Reasoning	Score
Conciseness	The skill is fairly comprehensive but includes some unnecessary verbosity—sections like 'Long Text Handling' with a Chinese content curation example and the 'Real-World Example' project structure add bulk that could be in referenced files. Some explanations (e.g., 'open-source CLI tool for testing and comparing LLM outputs') are unnecessary for Claude. However, most content is practical configuration and code rather than conceptual padding.	2 / 3
Actionability	The skill provides fully executable YAML configs, Python assertion code, bash commands, and JSON prompt templates that are copy-paste ready. Every section includes concrete, working examples with specific syntax rather than pseudocode or vague descriptions.	3 / 3
Workflow Clarity	The Quick Start provides a clear 3-step sequence, and the echo provider preview-before-run pattern is a good validation checkpoint. However, the overall evaluation workflow lacks explicit validation steps—there's no 'validate your config before running' step, no feedback loop for when eval fails, and the troubleshooting section is disconnected from the workflow rather than integrated as checkpoints.	2 / 3
Progressive Disclosure	The skill references `references/promptfoo_api.md` and a `./tiaogaoren/` example project, showing awareness of progressive disclosure. However, the bundle has no files, so these references are unverifiable. More importantly, the SKILL.md itself is quite long (~300+ lines) with sections like the full real-world example, long text handling, and advanced few-shot patterns that could be split into separate referenced files rather than inlined.	2 / 3
	Total	9 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: daymade/claude-code-skills
Commit: bbf87f3

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.