CtrlK
BlogDocsLog inGet started
Tessl Logo

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

88

1.59x
Quality

Does it follow best practices?

Impact

97%

1.59x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

65%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with complete, executable examples, but suffers from redundant repetition of a few gotchas and keeps most detail inline rather than splitting it into references. Workflow sequencing is present but lacks an explicit validation feedback loop.

Suggestions

Consolidate the maxConcurrency/commandLineOptions and llm-rubric relay/provider gotchas into a single canonical location to remove the repeated explanations.

Move the long-form sections (e.g., Long Text Handling, Advanced Few-Shot, real-world tiaogaoren example) into references/ files and link to them from SKILL.md to improve progressive disclosure.

Add an explicit validate->fix->retry checkpoint for the eval workflow, e.g., run the echo provider to validate prompt rendering, then run eval, then inspect results and re-run on failures.

DimensionReasoningScore

Conciseness

The body is mostly efficient and actionable, but repeats the same gotchas multiple times: the maxConcurrency/commandLineOptions rule appears in the config example, the relay section, the troubleshooting section, and the key rules; the llm-rubric relay/provider 401 issue is restated three times.

2 / 3

Actionability

Provides fully executable, copy-paste-ready YAML configs, Python assertion functions, and CLI commands (e.g., "npx promptfoo@latest eval", complete get_assert/custom_check implementations).

3 / 3

Workflow Clarity

Quick Start gives a clear init->eval->view sequence and a troubleshooting section aids recovery, but the core workflow lacks an explicit validate->fix->retry checkpoint loop.

2 / 3

Progressive Disclosure

There is one clearly signaled one-level-deep reference (references/promptfoo_api.md) at the end, but most detail lives inline in SKILL.md and the referenced example project ./tiaogaoren/ does not exist as a bundle path.

2 / 3

Total

9

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is exemplary: it states concrete capabilities, gives explicit trigger guidance with natural keywords, and carves out a clear niche. It satisfies every dimension at the top anchor.

DimensionReasoningScore

Specificity

Lists multiple concrete actions such as "Configures and runs LLM evaluation", "creating evaluation configs (promptfooconfig.yaml)", "writing Python custom assertions", "implementing llm-rubric", and "managing few-shot examples".

3 / 3

Completeness

Explicitly answers both what ("Configures and runs LLM evaluation using Promptfoo framework") and when ("Use when setting up prompt testing... Triggers on keywords like...").

3 / 3

Trigger Term Quality

Includes natural user keywords "promptfoo", "eval", "LLM evaluation", "prompt testing", and "model comparison", covering common phrasings a user would actually say.

3 / 3

Distinctiveness Conflict Risk

Anchored to a specific framework (Promptfoo) with distinct trigger terms, making it unlikely to fire for unrelated skills.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository
daymade/claude-code-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.