Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
91
88%
Does it follow best practices?
Impact
97%
1.59xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Relay API configuration
apiBaseUrl placement
100%
100%
maxConcurrency location
0%
100%
maxConcurrency value
0%
100%
llm-rubric provider config
100%
100%
llm-rubric apiBaseUrl repeated
100%
100%
Anthropic provider ID format
0%
100%
Schema comment present
0%
100%
outputPath defined
0%
0%
file:// path usage
62%
100%
llm-rubric threshold set
0%
100%
Provider label present
0%
100%
ANTHROPIC_API_KEY note
100%
100%
Python custom assertions
Default function name
100%
100%
Named function reference
0%
100%
Return dict format
100%
100%
Reason field returned
100%
100%
named_scores included
100%
100%
Context vars access pattern
100%
100%
HTML stripping present
80%
100%
file:// path for assertions
100%
100%
file:// relative to config root
100%
100%
PROMPTFOO_PYTHON note
0%
66%
Schema comment present
0%
100%
Standard directory structure
0%
100%
Few-shot setup and echo preview
Chat JSON format
100%
100%
Assistant turn in prompt
100%
100%
1-3 few-shot examples
100%
100%
Examples from files
100%
100%
Echo provider config
100%
100%
OpenAI provider ID format
0%
100%
max_tokens set high
100%
100%
outputPath defined
0%
100%
llm-rubric with threshold
50%
100%
Schema comment present
0%
100%
Standard directory layout
50%
100%
maxConcurrency under commandLineOptions
100%
100%
392d34c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.