Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
88
82%
Does it follow best practices?
Impact
97%
1.59xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly identifies the tool (Promptfoo), lists specific concrete actions, provides explicit 'Use when' guidance with multiple trigger scenarios, and enumerates natural keywords. It uses proper third-person voice throughout and is both comprehensive and concise without unnecessary padding.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: configures and runs LLM evaluation, setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, managing few-shot examples in prompts. | 3 / 3 |
Completeness | Clearly answers both 'what' (configures and runs LLM evaluation using Promptfoo framework, with specific sub-tasks listed) and 'when' (explicit 'Use when...' clause with trigger scenarios and a 'Triggers on keywords like...' clause listing specific terms). | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms including 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison', 'promptfooconfig.yaml', 'llm-rubric', 'custom assertions', and 'few-shot examples'. These cover both tool-specific and general terms users would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive due to the specific tool name 'Promptfoo', specific file format 'promptfooconfig.yaml', and domain-specific concepts like 'llm-rubric' and 'LLM-as-judge'. Very unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
64%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a highly actionable and practical Promptfoo skill with excellent concrete examples, real gotchas (like maxConcurrency placement and relay provider inheritance), and executable code throughout. Its main weaknesses are length — several advanced sections could be offloaded to reference files — and the lack of an integrated end-to-end workflow with explicit validation checkpoints guiding the user from setup through verified results.
Suggestions
Add an explicit end-to-end workflow section (e.g., '1. Init → 2. Configure → 3. Preview with echo provider → 4. Verify prompts render correctly → 5. Run eval → 6. Review results') with validation gates between steps.
Move 'Long Text Handling', 'Advanced Few-Shot Implementation', and 'Real-World Example' sections into separate reference files and link to them from the main skill, keeping SKILL.md as a concise overview.
Remove the introductory description of Promptfoo ('an open-source CLI tool for testing and comparing LLM outputs') — Claude doesn't need this context.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is fairly comprehensive but includes some sections that could be trimmed or consolidated. The 'Long Text Handling' section with Chinese-specific content and the 'Real-World Example' project structure are somewhat niche and add bulk. Some explanatory text like 'an open-source CLI tool for testing and comparing LLM outputs' is unnecessary for Claude. However, most content earns its place with practical gotchas and concrete configs. | 2 / 3 |
Actionability | Excellent actionability throughout — nearly every section includes copy-paste ready YAML configs, executable Python code, and specific CLI commands. The custom assertion examples are complete with proper function signatures and return types. The troubleshooting section provides concrete solutions to specific problems. | 3 / 3 |
Workflow Clarity | The Quick Start provides a clear 3-step sequence, and individual sections are well-organized. However, there's no overarching workflow with validation checkpoints for the full evaluation setup process (e.g., 'validate config → preview with echo → run eval → check results'). The echo provider preview is mentioned but not integrated into a recommended workflow with explicit validation gates. | 2 / 3 |
Progressive Disclosure | There's a reference to 'references/promptfoo_api.md' and a 'See ./tiaogaoren/' pointer, which is good. However, the skill itself is quite long (~300+ lines) with detailed sections on long text handling, advanced few-shot, and relay configuration that could be split into separate reference files. The inline content is heavy for a SKILL.md overview. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
80e94fd
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.