Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Configure, troubleshoot, or investigate incidents in Datadog.
Deploy and configure the Datadog Agent on Kubernetes.
Steps:
datadoghq.eu / US: datadoghq.com), features needed (APM, logs, process monitoring)helm upgrade --install datadog datadog/datadog -f values.yaml -n datadogkubectl exec -n datadog ds/datadog -- agent statusDD_ENV, DD_SERVICE, DD_VERSION) to app DeploymentReference: references/datadog.md → Agent Setup
Add APM tracing to a service.
Steps:
dd-trace init must be the first import in Node.js; use ddtrace-run or patch_all() in PythonReference: references/datadog.md → APM Instrumentation, Unified Service Tagging
Create a Datadog monitor for a service.
Steps:
datadog_monitor resource (preferred over UI / API for IaC)notify_no_data: true and no_data_timeframe so silent services alertservice:, env:, team: for routingOutput monitor query, thresholds, notification message with @pagerduty-* and @slack-* handles.
Reference: references/datadog.md → Monitors, Terraform Monitor
Create a Datadog dashboard for a service.
Steps:
datadog_dashboard resource with timeseries_definition widgetstrace.web.request.hits, trace.web.request.errors, trace.web.request percentilesenv and service for reuse across environmentsReference: references/datadog.md → Dashboards
Define a Datadog SLO.
Steps:
datadog_service_level_objective resourceReference: references/datadog.md → SLOs
Live incident investigation using the Datadog MCP server.
Requires the Datadog MCP server connected to Claude Code. See setup in references/datadog.md → MCP Server Setup.
Ask Claude to run these via the MCP server:
service:<name> over the last 2 hoursWhat monitors are currently firing for service:orders-service env:production?
Show me the event stream for the orders-service in the last 30 minutes.
Were there any deployments to orders-service in the last 2 hours?Correlate the three pillars through the MCP:
Show me error logs for service:orders-service between <start> and <end>.
What is the error rate and p99 latency for orders-service over the last hour?
Find traces with errors for orders-service — show me the top error messages.Narrow down with MCP queries:
Compare the error rate for orders-service before and after <incident-start-time>.
Which endpoints have the highest error rate on orders-service right now?
Show me CPU and memory metrics for the hosts running orders-service.Resolve the monitor "orders-service high error rate" — the fix has been deployed.
Post to #incidents: "orders-service error rate returning to baseline, fix deployed at <time>."
Create a Datadog notebook summarising the orders-service incident timeline.Reference: references/datadog.md → MCP Server Setup, Incident Investigation Workflow
Diagnose Datadog data gaps or agent issues without the MCP server.
Classify the failure:
Evidence to collect:
# Agent health
kubectl exec -n datadog ds/datadog -- agent status
# Check APM port
kubectl exec -n datadog ds/datadog -- agent check apm
# Verify pod env vars
kubectl describe pod <app-pod> | grep -E "DD_ENV|DD_SERVICE|DD_VERSION|DD_TRACE"
# Test metric query in Metrics Explorer before using in monitorProvide: symptom → root cause hypothesis → evidence command → fix → validation
Perform Datadog operations via the pup CLI — log search, metric queries, monitor management, and post-deploy gates.
Requires pup installed and DD_API_KEY, DD_APP_KEY, DD_SITE set in shell. See references/datadog.md → pup CLI for install and config.
Steps:
pup command with flagsreferences/datadog.md → pup CLI--format json outputReference: references/datadog.md → pup CLI
Instrument an AI application with Datadog LLM Observability, bootstrap evaluators, analyze experiments, or root-cause LLM failures.
Steps:
dd-llmo-eval-trace-rcadd-llmo-experiment-analyzerLLMObs.enable() init and @llm / llmobs.trace() call for the main LLM call; add DD_LLMOBS_ML_APP and DD_LLMOBS_AGENTLESS_ENABLED to the Deployment env varsLLMObs.submit_evaluation() calls (kwarg span=LLMObs.export_span()) after each LLM span; generate a pup-based CI gate querying ml_obs.evaluations.<label> metric/dd-llmo-eval-trace-rca with the failing trace ID; provide the manual pup apm traces get fallbackexperiment.name, experiment.variant, experiment.cohort tag annotations; instruct user to invoke /dd-llmo-experiment-analyzer; provide manual pup comparison fallbackDD_LLMOBS_ML_APP — spans without this tag are silently discardedReference: references/llm-observability.md
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests