CtrlK
BlogDocsLog inGet started
Tessl Logo

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

88

1.59x
Quality

82%

Does it follow best practices?

Impact

97%

1.59x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly identifies the tool (Promptfoo), lists specific concrete actions, provides explicit 'Use when' guidance with multiple trigger scenarios, and enumerates natural keywords. It uses proper third-person voice throughout and is both comprehensive and concise without unnecessary padding.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: configures and runs LLM evaluation, setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, managing few-shot examples in prompts.

3 / 3

Completeness

Clearly answers both 'what' (configures and runs LLM evaluation using Promptfoo framework, with specific sub-tasks listed) and 'when' (explicit 'Use when...' clause with trigger scenarios and a 'Triggers on keywords like...' clause listing specific terms).

3 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms including 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison', 'promptfooconfig.yaml', 'llm-rubric', 'custom assertions', and 'few-shot examples'. These cover both tool-specific and general terms users would naturally use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive due to the specific tool name 'Promptfoo', specific file format 'promptfooconfig.yaml', and domain-specific concepts like 'llm-rubric' and 'LLM-as-judge'. Very unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a highly actionable and practical Promptfoo skill with excellent concrete examples, real gotchas (like maxConcurrency placement and relay provider inheritance), and executable code throughout. Its main weaknesses are length — several advanced sections could be offloaded to reference files — and the lack of an integrated end-to-end workflow with explicit validation checkpoints guiding the user from setup through verified results.

Suggestions

Add an explicit end-to-end workflow section (e.g., '1. Init → 2. Configure → 3. Preview with echo provider → 4. Verify prompts render correctly → 5. Run eval → 6. Review results') with validation gates between steps.

Move 'Long Text Handling', 'Advanced Few-Shot Implementation', and 'Real-World Example' sections into separate reference files and link to them from the main skill, keeping SKILL.md as a concise overview.

Remove the introductory description of Promptfoo ('an open-source CLI tool for testing and comparing LLM outputs') — Claude doesn't need this context.

DimensionReasoningScore

Conciseness

The skill is fairly comprehensive but includes some sections that could be trimmed or consolidated. The 'Long Text Handling' section with Chinese-specific content and the 'Real-World Example' project structure are somewhat niche and add bulk. Some explanatory text like 'an open-source CLI tool for testing and comparing LLM outputs' is unnecessary for Claude. However, most content earns its place with practical gotchas and concrete configs.

2 / 3

Actionability

Excellent actionability throughout — nearly every section includes copy-paste ready YAML configs, executable Python code, and specific CLI commands. The custom assertion examples are complete with proper function signatures and return types. The troubleshooting section provides concrete solutions to specific problems.

3 / 3

Workflow Clarity

The Quick Start provides a clear 3-step sequence, and individual sections are well-organized. However, there's no overarching workflow with validation checkpoints for the full evaluation setup process (e.g., 'validate config → preview with echo → run eval → check results'). The echo provider preview is mentioned but not integrated into a recommended workflow with explicit validation gates.

2 / 3

Progressive Disclosure

There's a reference to 'references/promptfoo_api.md' and a 'See ./tiaogaoren/' pointer, which is good. However, the skill itself is quite long (~300+ lines) with detailed sections on long text handling, advanced few-shot, and relay configuration that could be split into separate reference files. The inline content is heavy for a SKILL.md overview.

2 / 3

Total

9

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
daymade/claude-code-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.