Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
88
82%
Does it follow best practices?
Impact
97%
1.59xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly identifies the tool (Promptfoo), lists specific concrete actions, provides explicit 'Use when' guidance with multiple trigger scenarios, and enumerates natural keywords. It uses proper third-person voice throughout and is concise without being vague. The description would allow Claude to confidently select this skill from a large pool without ambiguity.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: configuring and running LLM evaluation, setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, and managing few-shot examples in prompts. | 3 / 3 |
Completeness | Clearly answers both 'what' (configures and runs LLM evaluation using Promptfoo framework, with specific sub-tasks listed) and 'when' (explicit 'Use when...' clause with trigger scenarios and a 'Triggers on keywords' clause listing specific terms). | 3 / 3 |
Trigger Term Quality | Includes a strong set of natural trigger terms users would actually say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison', plus specific artifacts like 'promptfooconfig.yaml', 'llm-rubric', and 'custom assertions'. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with a clear niche around the Promptfoo framework specifically. Terms like 'promptfoo', 'promptfooconfig.yaml', 'llm-rubric', and 'LLM-as-judge' are very specific and unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
64%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a solid, highly actionable skill with excellent concrete examples covering Promptfoo configuration, custom assertions, and common gotchas. Its main weaknesses are length—several advanced sections (long text handling, real-world example, advanced few-shot) should be in referenced files rather than inline—and the lack of explicit validation checkpoints in the evaluation workflow. The troubleshooting section captures real pain points (maxConcurrency placement, relay 401 errors) which adds genuine value.
Suggestions
Move the 'Long Text Handling', 'Advanced Few-Shot Implementation', and 'Real-World Example' sections into separate referenced files to reduce the main SKILL.md to a concise overview with pointers
Add an explicit workflow section with validation checkpoints: e.g., '1. Write config → 2. Preview with echo provider → 3. Run single test case → 4. Check results → 5. Run full eval'
Remove the overview sentence explaining what Promptfoo is ('an open-source CLI tool for testing and comparing LLM outputs')—Claude doesn't need this context
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is fairly comprehensive but includes some unnecessary verbosity—sections like 'Long Text Handling' with a Chinese content curation example and the 'Real-World Example' project structure add bulk that could be in referenced files. Some explanations (e.g., 'open-source CLI tool for testing and comparing LLM outputs') are unnecessary for Claude. However, most content is practical configuration and code rather than conceptual padding. | 2 / 3 |
Actionability | The skill provides fully executable YAML configs, Python assertion code, bash commands, and JSON prompt templates that are copy-paste ready. Every section includes concrete, working examples with specific syntax rather than pseudocode or vague descriptions. | 3 / 3 |
Workflow Clarity | The Quick Start provides a clear 3-step sequence, and the echo provider preview-before-run pattern is a good validation checkpoint. However, the overall evaluation workflow lacks explicit validation steps—there's no 'validate your config before running' step, no feedback loop for when eval fails, and the troubleshooting section is disconnected from the workflow rather than integrated as checkpoints. | 2 / 3 |
Progressive Disclosure | The skill references `references/promptfoo_api.md` and a `./tiaogaoren/` example project, showing awareness of progressive disclosure. However, the bundle has no files, so these references are unverifiable. More importantly, the SKILL.md itself is quite long (~300+ lines) with sections like the full real-world example, long text handling, and advanced few-shot patterns that could be split into separate referenced files rather than inlined. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
bbf87f3
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.