Load test a Databricks App to find its maximum QPS. Use when: (1) User says 'load test', 'benchmark', 'QPS', 'throughput', or 'performance test', (2) User wants to find how many queries per second their app can handle, (3) User wants to set up load testing scripts for their agent, (4) User wants to view load test results/dashboard.
68
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Risky
Do not use without reviewing
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong skill description that clearly defines its purpose, lists concrete capabilities, and provides explicit trigger guidance with natural user terms. It follows the recommended pattern with a concise 'what' statement followed by a well-structured 'Use when' clause covering multiple scenarios. The description is specific to a clear niche (Databricks App load testing) making it highly distinguishable.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists specific concrete actions: load testing a Databricks App, finding maximum QPS, setting up load testing scripts, and viewing load test results/dashboard. These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both 'what' (load test a Databricks App to find its maximum QPS) and 'when' with an explicit 'Use when:' clause listing four specific trigger scenarios. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms: 'load test', 'benchmark', 'QPS', 'throughput', 'performance test', 'queries per second', 'load testing scripts', 'load test results/dashboard'. These are terms users would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive — targets a specific niche (load testing Databricks Apps for QPS) with domain-specific triggers like 'QPS', 'throughput', and 'Databricks App'. Unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured, domain-specific skill with a clear 5-step workflow and good validation checkpoints. Its main weakness is that the three core Python scripts are described rather than provided as executable code, which undermines actionability — Claude must generate substantial code from prose specifications. The skill is also longer than necessary, with some sections (dashboard interpretation, mocking rationale) that could be trimmed or split out.
Suggestions
Provide the actual executable code for locustfile.py, run_load_test.py, and dashboard_template.py (or include them as bundle files and reference them), rather than describing what they should do in prose.
Split mocking guidance, dashboard interpretation, and troubleshooting into separate referenced files to reduce the main skill's length and improve progressive disclosure.
Trim explanatory prose that Claude can infer — e.g., remove the bullet list explaining why mocking is useful and the 'Interpreting Results' definitions of obvious metrics like 'Failure Rate'.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is quite long (~250 lines) and includes some information Claude could infer (e.g., explaining what mocking is useful for, what peak QPS means). However, most content is domain-specific configuration details (DAB variables, Locust pinning, OAuth setup) that genuinely add value. Could be tightened by ~30% without losing substance. | 2 / 3 |
Actionability | The skill provides concrete CLI commands, YAML configs, and directory structures, which is strong. However, the three core scripts (locustfile.py, run_load_test.py, dashboard_template.py) are described in prose rather than provided as executable code — Claude is told what they should do but must generate them from descriptions. The mock example is a 3-line snippet referencing a file that isn't in the bundle. | 2 / 3 |
Workflow Clarity | The 5-step workflow is clearly sequenced with explicit validation checkpoints: verify apps are ACTIVE before testing, healthcheck + warmup before load test, and a troubleshooting table for common failures. The ramp-to-saturation process includes clear interpretation guidance for identifying the saturation point. | 3 / 3 |
Progressive Disclosure | The skill references `examples/mock_openai_client.py` but no bundle files are provided, making this reference unverifiable. The content is monolithic — all 250+ lines are in a single file. The dashboard interpretation, troubleshooting, and mocking sections could be split into separate reference files to keep the main skill leaner. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
1c88215
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.