Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
Install with Tessl CLI
npx tessl i github:daymade/claude-code-skills --skill promptfoo-evaluationOverall
score
86%
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that hits all the marks. It provides specific concrete actions, includes explicit trigger keywords, clearly states both what the skill does and when to use it, and has highly distinctive terminology that prevents conflicts with other skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'setting up prompt testing', 'creating evaluation configs (promptfooconfig.yaml)', 'writing Python custom assertions', 'implementing llm-rubric for LLM-as-judge', and 'managing few-shot examples in prompts'. | 3 / 3 |
Completeness | Clearly answers both what ('Configures and runs LLM evaluation using Promptfoo framework') and when ('Use when setting up prompt testing...') with explicit 'Triggers on keywords' clause providing clear activation guidance. | 3 / 3 |
Trigger Term Quality | Explicitly lists natural trigger keywords users would say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison'. Also includes technical terms like 'promptfooconfig.yaml' and 'llm-rubric' that users familiar with the tool would use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with specific tool name 'Promptfoo', unique file format 'promptfooconfig.yaml', and specialized concepts like 'llm-rubric' and 'LLM-as-judge' that clearly differentiate it from generic testing or evaluation skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
73%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, actionable skill with excellent code examples covering the full Promptfoo workflow. The main weaknesses are some unnecessary explanatory content (the overview paragraph, the specific local path example) and missing explicit validation checkpoints for catching configuration errors before expensive API calls.
Suggestions
Remove the Overview paragraph - Claude knows what Promptfoo is; start directly with Quick Start
Add a validation workflow section: e.g., '1. Run with echo provider to verify prompts, 2. Run with --dry-run if available, 3. Run single test case first, 4. Run full eval'
Remove or generalize the 'Real-World Example' section - the specific local path (/Users/tiansheng/...) is not useful and the pattern is already covered in earlier sections
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient with good code examples, but includes some unnecessary sections like the 'Overview' paragraph explaining what Promptfoo is (Claude knows this), and the 'Real-World Example' section with a specific local path that adds little value. | 2 / 3 |
Actionability | Excellent executable code throughout - complete YAML configs, Python assertion functions with proper return types, bash commands, and JSON prompt formats. All examples are copy-paste ready with realistic values. | 3 / 3 |
Workflow Clarity | The Quick Start provides a clear sequence, and the echo provider section shows a good preview-before-production pattern. However, there's no explicit validation workflow for catching config errors before running expensive API calls, and the troubleshooting section is reactive rather than preventive. | 2 / 3 |
Progressive Disclosure | Well-structured with clear sections progressing from Quick Start to Core Configuration to Advanced patterns. References external file (references/promptfoo_api.md) appropriately for detailed API docs, keeping the main skill focused on practical usage. | 3 / 3 |
Total | 10 / 12 Passed |
Validation
88%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 14 / 16 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
metadata_version | 'metadata' field is not a dictionary | Warning |
license_field | 'license' field is missing | Warning |
Total | 14 / 16 Passed | |
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.