Use when a dbt Cloud/platform job fails and you need to diagnose the root cause, especially when error messages are unclear or when intermittent failures occur. Do not use for local dbt development errors.
Install with Tessl CLI
npx tessl i github:dbt-labs/dbt-agent-skills --skill troubleshooting-dbt-job-errors84
Quality
83%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Systematically diagnose and resolve dbt Cloud job failures using available MCP tools, CLI commands, and data investigation.
Not for: Local dbt development errors - use the skill using-dbt-for-analytics-engineering instead
Never modify a test to make it pass without understanding why it's failing.
A failing test is evidence of a problem. Changing the test to pass hides the problem. Investigate the root cause first.
| You're Thinking... | Reality |
|---|---|
| "Just make the test pass" | The test is telling you something is wrong. Investigate first. |
| "There's a board meeting in 2 hours" | Rushing to a fix without diagnosis creates bigger problems. |
| "We've already spent 2 days on this" | Sunk cost doesn't justify skipping proper diagnosis. |
| "I'll just update the accepted values" | Are the new values valid business data or bugs? Verify first. |
| "It's probably just a flaky test" | "Flaky" means there's an overall issue. Find it. We don't allow flaky tests to stay. |
flowchart TD
A[Job failure reported] --> B{MCP Admin API available?}
B -->|yes| C[Use list_jobs_runs to get history]
B -->|no| D[Ask user for logs and run_results.json]
C --> E[Use get_job_run_error for details]
D --> F[Classify error type]
E --> F
F --> G{Error type?}
G -->|Infrastructure| H[Check warehouse, connections, timeouts]
G -->|Code/Compilation| I[Check git history for recent changes]
G -->|Data/Test Failure| J[Use discovering-data skill to investigate]
H --> K{Root cause found?}
I --> K
J --> K
K -->|yes| L[Create branch, implement fix]
K -->|no| M[Create findings document]
L --> N[Add test - prefer unit test]
N --> O[Create PR with explanation]
M --> P[Document what was checked and next steps]Use these tools first - they provide the most comprehensive data:
| Tool | Purpose |
|---|---|
list_jobs_runs | Get recent run history, identify patterns |
get_job_run_error | Get detailed error message and context |
# Example: Get recent runs for job 12345
list_jobs_runs(job_id=12345, limit=10)
# Example: Get error details for specific run
get_job_run_error(run_id=67890)Ask the user to provide these artifacts:
run_results.json - contains execution status for each nodeTo get the run_results.json, generate the artifact URL for the user:
https://<DBT_ENDPOINT>/api/v2/accounts/<ACCOUNT_ID>/runs/<RUN_ID>/artifacts/run_results.json?step=<STEP_NUMBER>Where:
<DBT_ENDPOINT> - The dbt Cloud endpoint. e.g
cloud.getdbt.com for the US multi-tenant platform (there are other endpoints for other regions)ACCOUNT_PREFIX.us1.dbt.com for the cell-based platforms (there are different cell endpoints for different regions and cloud providers)<ACCOUNT_ID> - The dbt Cloud account ID<RUN_ID> - The failed job run ID<STEP_NUMBER> - The step that failed (e.g., if step 4 failed, use ?step=4)Example request:
"I don't have access to the dbt MCP server. Could you provide:
- The debug logs from dbt Cloud (Job Run → Logs → Download)
- The run_results.json - open this URL and copy/paste or upload the contents:
https://cloud.getdbt.com/api/v2/accounts/12345/runs/67890/artifacts/run_results.json?step=4
| Error Type | Indicators | Primary Investigation |
|---|---|---|
| Infrastructure | Connection timeout, warehouse error, permissions | Check warehouse status, connection settings |
| Code/Compilation | Undefined macro, syntax error, parsing error | Check git history for recent changes, use LSP tools |
| Data/Test Failure | Test failed with N results, schema mismatch | Use discovering-data skill to query actual data |
Check git history for recent changes:
If you're not in the dbt project directory, use the dbt MCP server to find the repository:
# Get project details including repository URL and project subdirectory
get_project_details(project_id=<project_id>)The response includes:
repository - The git repository URLdbt_project_subdirectory - Optional subfolder where the dbt project lives (e.g., dbt/, transform/analytics/)Then either:
gh CLI if it's on GitHubgit clone <repo_url> /tmp/dbt-investigationImportant: If the project is in a subfolder, navigate to it after cloning:
cd /tmp/dbt-investigation/<project_subdirectory>Once in the project directory:
git log --oneline -20
git diff HEAD~5..HEAD -- models/ macros/Use the CLI and LSP tools from the dbt MCP server or use the dbt CLI to check for errors:
If the dbt MCP server is available, use its tools:
# CLI tools
mcp__dbt_parse() # Check for parsing errors
mcp__dbt_list_models() # With selectos and `+` for finding models dependencies
mcp__dbt_compile(models="failing_model") # Check compilation
# LSP tools
mcp__dbt_get_column_lineage() # Check column lineageOtherwise, use the dbt CLI directly:
dbt parse # Check for parsing errors
dbt list --select +failing_model # Check for models upstream of the failing model
dbt compile --select failing_model # Check compilationSearch for the error pattern:
Use the discovering-data skill to investigate the actual data.
Get the test SQL
dbt compile --select project_name.folder1.folder2.test_unique_name --output jsonthe full path for the test can be found with a dbt ls --resource-type test command
Query the failing test's underlying data:
dbt show --inline "<query_from_the_test_SQL>" --output jsonCompare to recent git changes:
Create a new branch:
git checkout -b fix/job-failure-<description>Implement the fix addressing the actual root cause
Add a test to prevent recurrence:
unit_tests:
- name: test_status_mapping
model: orders
given:
- input: ref('stg_orders')
rows:
- {status_code: 1, expected_status: 'pending'}
- {status_code: 2, expected_status: 'shipped'}
expect:
rows:
- {status: 'pending'}
- {status: 'shipped'}Create a PR with:
Do not guess. Create a findings document.
Create docs/investigations/job-failure-<date>.md:
# Job Failure Investigation: <Job Name>
**Date:** YYYY-MM-DD
**Job ID:** <id>
**Status:** Unresolved
## Summary
Brief description of the failure and symptoms.
## What Was Checked
### Tools Used
- [ ] list_jobs_runs - findings
- [ ] get_job_run_error - findings
- [ ] git history - findings
- [ ] Data investigation - findings
### Hypotheses Tested
| Hypothesis | Evidence | Result |
|------------|----------|--------|
| Recent code change | No changes to affected models in 7 days | Ruled out |
## Patterns Observed
- Failures occur between 2-4 AM (peak load time?)
- Always fails on model X
## Suggested Next Steps
1. [ ] Check the data ingestion process to see if new data was added
2. [ ] Check if a new version of dbt or of the dbt adapter was released
## Related Resources
- Link to job run logs
- Link to relevant documentationCommit this document to the repository so findings aren't lost.
| Task | Tool/Command |
|---|---|
| Get job run history | list_jobs_runs (MCP) |
| Get detailed error | get_job_run_error (MCP) |
| Check recent git changes | git log --oneline -20 |
| Parse project | dbt parse |
| Compile specific model | dbt compile --select model_name |
| Query data | dbt show --inline "SELECT ..." --output json |
| Run specific test | dbt test --select test_name |
Modifying tests to pass without investigation
Skipping git history review
Not documenting when unresolved
Making best-guess fixes under pressure
Ignoring data investigation for test failures
65d2e0b
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.