CtrlK
BlogDocsLog inGet started
Tessl Logo

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

88

1.59x
Quality

82%

Does it follow best practices?

Impact

97%

1.59x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly identifies the tool (Promptfoo), lists specific concrete actions, provides explicit 'Use when' guidance with multiple trigger scenarios, and enumerates natural keywords. It uses proper third-person voice throughout and is concise without being vague. The description would allow Claude to confidently select this skill from a large pool without ambiguity.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: configuring and running LLM evaluation, setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, and managing few-shot examples in prompts.

3 / 3

Completeness

Clearly answers both 'what' (configures and runs LLM evaluation using Promptfoo framework, with specific sub-tasks listed) and 'when' (explicit 'Use when...' clause with trigger scenarios and a 'Triggers on keywords' clause listing specific terms).

3 / 3

Trigger Term Quality

Includes a strong set of natural trigger terms users would actually say: 'promptfoo', 'eval', 'LLM evaluation', 'prompt testing', 'model comparison', plus specific artifacts like 'promptfooconfig.yaml', 'llm-rubric', and 'custom assertions'.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with a clear niche around the Promptfoo framework specifically. Terms like 'promptfoo', 'promptfooconfig.yaml', 'llm-rubric', and 'LLM-as-judge' are very specific and unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, highly actionable skill with excellent concrete examples covering Promptfoo configuration, custom assertions, and common gotchas. Its main weaknesses are length—several advanced sections (long text handling, real-world example, advanced few-shot) should be in referenced files rather than inline—and the lack of explicit validation checkpoints in the evaluation workflow. The troubleshooting section captures real pain points (maxConcurrency placement, relay 401 errors) which adds genuine value.

Suggestions

Move the 'Long Text Handling', 'Advanced Few-Shot Implementation', and 'Real-World Example' sections into separate referenced files to reduce the main SKILL.md to a concise overview with pointers

Add an explicit workflow section with validation checkpoints: e.g., '1. Write config → 2. Preview with echo provider → 3. Run single test case → 4. Check results → 5. Run full eval'

Remove the overview sentence explaining what Promptfoo is ('an open-source CLI tool for testing and comparing LLM outputs')—Claude doesn't need this context

DimensionReasoningScore

Conciseness

The skill is fairly comprehensive but includes some unnecessary verbosity—sections like 'Long Text Handling' with a Chinese content curation example and the 'Real-World Example' project structure add bulk that could be in referenced files. Some explanations (e.g., 'open-source CLI tool for testing and comparing LLM outputs') are unnecessary for Claude. However, most content is practical configuration and code rather than conceptual padding.

2 / 3

Actionability

The skill provides fully executable YAML configs, Python assertion code, bash commands, and JSON prompt templates that are copy-paste ready. Every section includes concrete, working examples with specific syntax rather than pseudocode or vague descriptions.

3 / 3

Workflow Clarity

The Quick Start provides a clear 3-step sequence, and the echo provider preview-before-run pattern is a good validation checkpoint. However, the overall evaluation workflow lacks explicit validation steps—there's no 'validate your config before running' step, no feedback loop for when eval fails, and the troubleshooting section is disconnected from the workflow rather than integrated as checkpoints.

2 / 3

Progressive Disclosure

The skill references `references/promptfoo_api.md` and a `./tiaogaoren/` example project, showing awareness of progressive disclosure. However, the bundle has no files, so these references are unverifiable. More importantly, the SKILL.md itself is quite long (~300+ lines) with sections like the full real-world example, long text handling, and advanced few-shot patterns that could be split into separate referenced files rather than inlined.

2 / 3

Total

9

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
daymade/claude-code-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.