Load test a Databricks App to find its maximum QPS. Use when: (1) User says 'load test', 'benchmark', 'QPS', 'throughput', or 'performance test', (2) User wants to find how many queries per second their app can handle, (3) User wants to set up load testing scripts for their agent, (4) User wants to view load test results/dashboard.
85
81%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Goal: Find the maximum QPS (queries per second) your Databricks App can support.
Before beginning, use the AskUserQuestion tool to collect the following from the user:
DATABRICKS_HOST? (workspace URL)Create a load-test-scripts/ directory in the project with the following files. These scripts are framework-agnostic and work with any Databricks App.
<project-root>/
agent_server/ # Existing agent code
load-test-scripts/ # Load testing scripts (create this)
run_load_test.py # Main CLI — orchestrates Locust tests
locustfile.py # Locust test definition (SSE streaming, TTFT tracking)
dashboard_template.py # Generates interactive HTML dashboard from results
.env.example # Template for env vars
load-test-runs/ # Test results (auto-created per run)
<run-name>/
dashboard.html # Interactive dashboard
test_config.json # Test parameters
<label>/ # Per-config Locust CSV resultslocustfile.py — Locust load test that:
POST /invocations with {"input": [...], "stream": true} to the appdata: {json} lines) and counts chunks until data: [DONE]data: line) as a custom Locust metricclient_credentials grant to {host}/oidc/v1/token) with auto-refreshStepRampShape — ramps users from step_size to max_users, holding each level for step_duration secondsrun_load_test.py — CLI orchestrator that:
--app-url (repeatable), --client-id, --client-secret, --max-users, --step-size, --step-duration, --run-name, --dashboard, --compute-size, --label flagsload-test-runs/<run-name>/<label>/--dashboard is passeddashboard_template.py — Generates a self-contained HTML dashboard with Chart.js:
uv run dashboard_template.py ../load-test-runs/<run-name>/The load testing scripts use their own pyproject.toml inside load-test-scripts/ to avoid polluting the agent's production dependencies.
# load-test-scripts/pyproject.toml
[project]
name = "load-test-scripts"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"locust>=2.32,<2.40",
"urllib3<2.3",
"requests",
]Then install from within the load-test-scripts/ directory:
cd load-test-scripts/
uv syncNote:
locust>=2.43has a knownRecursionError. Pin to<2.40to avoid it.
Mocking is optional — you can skip this step to test your real agent end-to-end (including LLM latency). However, mocking is useful for:
The mock timing is controlled by two environment variables (set in app.yaml or databricks.yml):
MOCK_CHUNK_DELAY_MS — delay between text chunks in milliseconds (default: 10)MOCK_CHUNK_COUNT — number of text chunks per response (default: 80)For OpenAI Agents SDK templates: Create a MockAsyncOpenAI client that replaces AsyncDatabricksOpenAI. It simulates tool call streaming (instant) and text response streaming (delayed chunks). A reference implementation is available at examples/mock_openai_client.py:
from agent_server.mock_openai_client import MockAsyncOpenAI
set_default_openai_client(MockAsyncOpenAI())
set_default_openai_api("chat_completions")For LangGraph templates: Replace the ChatDatabricks model with a mock that returns pre-built AIMessage objects with tool calls and text content using configurable delays.
For custom agents: Wrap whatever external API calls you make (LLM, vector search, etc.) with mock implementations that return realistic response shapes.
Deploy multiple Databricks Apps with varying compute sizes and worker counts.
| Compute Size | Workers | App Name |
|---|---|---|
| Medium | 2 | <your-app>-medium-w2 |
| Medium | 4 | <your-app>-medium-w4 |
| Medium | 6 | <your-app>-medium-w6 |
| Medium | 8 | <your-app>-medium-w8 |
| Large | 6 | <your-app>-large-w6 |
| Large | 8 | <your-app>-large-w8 |
| Large | 10 | <your-app>-large-w10 |
| Large | 12 | <your-app>-large-w12 |
Databricks CLI:
databricks apps create <app-name> --compute-size MEDIUM
databricks apps update <app-name> --compute-size LARGEDatabricks UI: Go to Compute > Apps > your app > Edit > Configure > Compute.
start-server (via AgentServer.run()) accepts a --workers flag directly. Pass the worker count in the command array using a DAB variable — no wrapper script needed:
variables:
app_name:
default: "my-agent-medium-w2"
workers:
default: "2"
resources:
apps:
load_test_app:
name: ${var.app_name}
source_code_path: .
config:
command: ["uv", "run", "start-server", "--workers", "${var.workers}"]
env:
- name: MOCK_CHUNK_DELAY_MS
value: "10"
- name: MOCK_CHUNK_COUNT
value: "80"
targets:
medium-w2:
default: true
variables:
app_name: "my-agent-medium-w2"
workers: "2"
large-w8:
variables:
app_name: "my-agent-large-w8"
workers: "8"databricks bundle deploy --target medium-w2
databricks bundle run load_test_app --target medium-w2Verify apps are ACTIVE before proceeding:
databricks apps get <app-name> --output json | jq '{app_status, compute_status, url}'Load tests can run for hours. U2M OAuth tokens expire and break your test mid-run. Use M2M (machine-to-machine) OAuth with a service principal instead.
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_CLIENT_ID=<your-client-id>
export DATABRICKS_CLIENT_SECRET=<your-client-secret>| Parameter | Required | Default | Description |
|---|---|---|---|
--app-url | Yes | — | App URL(s) to test (repeatable) |
--client-id | Recommended | DATABRICKS_CLIENT_ID env | Service principal client ID |
--client-secret | Recommended | DATABRICKS_CLIENT_SECRET env | Service principal client secret |
--label | No | Auto-derived from URL | Human-readable label per app (repeatable) |
--compute-size | No | Auto-detected or medium | Compute size tag per app: medium, large (repeatable) |
--max-users | No | 300 | Maximum concurrent simulated users |
--step-size | No | 20 | Users added per ramp step |
--step-duration | No | 30 | Seconds per ramp step |
--spawn-rate | No | 20 | User spawn rate (users/sec) |
--run-name | No | <timestamp> | Name for this run — results saved to load-test-runs/<run-name>/ |
--dashboard | No | Off | Generate interactive HTML dashboard after tests |
cd load-test-scripts/
# Quick single-app test:
uv run run_load_test.py \
--app-url https://my-app.aws.databricksapps.com \
--client-id <ID> --client-secret <SECRET> \
--dashboard --run-name quick-test
# Full matrix — 8 apps, overnight:
uv run run_load_test.py \
--app-url https://my-app-medium-w2.aws.databricksapps.com \
--app-url https://my-app-medium-w4.aws.databricksapps.com \
--app-url https://my-app-large-w8.aws.databricksapps.com \
--app-url https://my-app-large-w10.aws.databricksapps.com \
--compute-size medium --compute-size medium \
--compute-size large --compute-size large \
--max-users 1000 --step-size 20 --step-duration 10 \
--dashboard --run-name overnight-sweep
# Multiple runs for statistical consistency:
for RUN in r1 r2 r3 r4 r5; do
uv run run_load_test.py \
--app-url ... \
--client-id <ID> --client-secret <SECRET> \
--max-users 1000 --step-size 20 --step-duration 10 \
--run-name my_test_${RUN} --dashboard || break
done[DONE])step_duration seconds(max_users / step_size) * step_duration seconds per app(300 / 20) * 30 = 15 steps * 30s = ~7.5 min per appopen load-test-runs/<run-name>/dashboard.htmlcd load-test-scripts/
uv run dashboard_template.py ../load-test-runs/<run-name>/| Issue | Solution |
|---|---|
| Auth token expired mid-test | Use M2M OAuth (--client-id/--client-secret) instead of static tokens |
| Healthcheck fails | Verify app is ACTIVE: databricks apps get <name> --output json |
| 0 QPS / no results | Check load-test-runs/<run-name>/<label>/locust_output.log for errors |
| Low QPS despite high user count | App is saturated — try more workers or larger compute |
| High failure rate | App is overloaded — reduce --max-users or increase workers/compute |
| Dashboard shows no ramp data | Ensure results_stats_history.csv exists in each result subdir |
dfeb4ac
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.