Load test a Databricks App to find its maximum QPS. Use when: (1) User says 'load test', 'benchmark', 'QPS', 'throughput', or 'performance test', (2) User wants to find how many queries per second their app can handle, (3) User wants to set up load testing scripts for their agent, (4) User wants to view load test results/dashboard.
85
81%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong skill description that clearly defines its purpose, provides comprehensive trigger terms, and explicitly states both what the skill does and when it should be used. It follows the recommended pattern with a concise capability statement followed by a structured 'Use when' clause with multiple trigger scenarios. The description is well-scoped to a specific domain (Databricks App load testing) making it highly distinctive.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: load testing a Databricks App, finding maximum QPS, setting up load testing scripts, and viewing load test results/dashboard. These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both 'what' (load test a Databricks App to find its maximum QPS) and 'when' with an explicit 'Use when:' clause listing four specific trigger scenarios. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms users would say: 'load test', 'benchmark', 'QPS', 'throughput', 'performance test', 'queries per second', 'load testing scripts', 'load test results'. These are all terms users would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with a clear niche: load testing specifically for Databricks Apps. The combination of 'Databricks App', 'load test', 'QPS', and 'benchmark' creates a very specific domain unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a comprehensive load testing guide with excellent workflow structure and clear sequencing, but it suffers from being overly long for a single SKILL.md and critically lacks the actual script implementations (locustfile.py, run_load_test.py, dashboard_template.py) that are the core deliverables. The skill describes what to build rather than providing copy-paste-ready code for the most important artifacts, which significantly limits actionability.
Suggestions
Provide complete, executable implementations of locustfile.py, run_load_test.py, and dashboard_template.py either inline or as bundle files—these are the core deliverables and currently only have prose descriptions.
Split the content: keep Steps 1-4 in SKILL.md as a concise overview, and move the dashboard interpretation guide, troubleshooting table, and mocking details into separate referenced files.
Remove explanatory content Claude can infer (e.g., why mocking is useful, what Peak QPS means, what SSE is) to reduce token usage by ~30%.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is quite long (~300 lines) and includes some information Claude could infer (e.g., explaining what mocking is useful for, what Peak QPS means). However, most content is domain-specific configuration details (DAB variables, Locust setup, OAuth) that Claude wouldn't inherently know, so it's not egregiously verbose—just could be tightened in several places. | 2 / 3 |
Actionability | The skill provides concrete CLI commands, YAML configs, and directory structures, which is good. However, the three core scripts (locustfile.py, run_load_test.py, dashboard_template.py) are only described in prose rather than provided as executable code. Claude is told to 'create' these files but given only behavioral specifications, not actual implementations. The mock example is a 3-line snippet referencing a file that isn't in the bundle. | 2 / 3 |
Workflow Clarity | The 5-step workflow is clearly sequenced with explicit validation checkpoints (verify apps are ACTIVE before proceeding, healthcheck before load test, check logs for 0 QPS). The 'What Happens During a Run' section explains the progression, and the troubleshooting table provides error recovery guidance. The gather-parameters step upfront ensures prerequisites are met. | 3 / 3 |
Progressive Disclosure | The skill is a monolithic document with all content inline—the dashboard interpretation, troubleshooting, mocking guide, deployment matrix, and parameter reference could be split into separate files. It references `examples/mock_openai_client.py` but no bundle files are provided, making that reference unverifiable. The content would benefit from a concise overview with links to detailed sections. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
dfeb4ac
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.