Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Covers APM, Infrastructure Monitoring, Log Management, Synthetic Monitoring, Dashboards, Monitors, and SLOs.
The official Datadog MCP server lets Claude Code query logs, metrics, traces, monitors, and incidents directly — no manual API calls needed during incident investigation.
# EU site
claude mcp add --transport http datadog https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp
# US1 site
claude mcp add --transport http datadog https://mcp.datadoghq.com/api/unstable/mcp-server/mcpOr add to .mcp.json in your project root:
{
"mcpServers": {
"datadog": {
"type": "http",
"url": "https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp"
}
}
}Authentication uses your Datadog session — log in via the browser prompt on first use.
Note: The MCP endpoint path contains
/api/unstable/. This is the URL Datadog currently publishes. The path may change when the API is promoted to stable — check docs.datadoghq.com/bits_ai/mcp_server/ for the latest endpoint.
| Category | What you can do |
|---|---|
| Monitors | List firing monitors, get monitor details, resolve monitors |
| Logs | Search logs by service/env/time, filter by status |
| Metrics | Query time series, compare before/after incident |
| APM Traces | Fetch traces with errors, inspect spans and stack traces |
| Events | List deployment events, audit events by tag |
| Incidents | List active incidents, post updates |
| Notebooks | Create notebooks for post-mortem documentation |
Use /platform-skills:datadog investigate for a guided 4-phase workflow:
# datadog-values.yaml
datadog:
apiKeyExistingSecret: "datadog-secret" # kubectl create secret generic datadog-secret --from-literal=api-key="${DD_API_KEY}"
site: "datadoghq.eu" # or datadoghq.com
apm:
portEnabled: true
logs:
enabled: true
containerCollectAll: true
processAgent:
enabled: true
clusterName: "prod-eks"
agents:
tolerations:
- operator: Exists
clusterAgent:
enabled: true
replicas: 2# Create the API key secret first — never pass the key on the command line
kubectl create secret generic datadog-secret \
--from-literal=api-key="${DD_API_KEY}" \
-n datadog --dry-run=client -o yaml | kubectl apply -f -
helm repo add datadog https://helm.datadoghq.com
helm upgrade --install datadog datadog/datadog \
-f datadog-values.yaml \
-n datadog --create-namespacekubectl exec -n datadog ds/datadog -- agent status
kubectl exec -n datadog ds/datadog -- agent check disk// Must be the very first import
import tracer from "dd-trace";
tracer.init({
service: "orders-service",
env: process.env.DD_ENV ?? "production",
version: process.env.DD_VERSION, // inject from CI: git sha or semver
logInjection: true, // correlate logs and traces
});
// Custom span
const span = tracer.startSpan("payment.process");
span.setTag("payment.method", "card");
try {
await processPayment(order);
span.finish();
} catch (err) {
span.setTag("error", err);
span.finish();
throw err;
}# run with: ddtrace-run python app.py
# or instrument programmatically:
from ddtrace import tracer, patch_all
patch_all() # auto-instrument Django, Flask, SQLAlchemy, Redis, etc.
with tracer.trace("payment.process", service="orders-service") as span:
span.set_tag("payment.method", "card")
result = process_payment(order)Set these three tags consistently across all telemetry:
# Pod labels / env vars
DD_ENV: production
DD_SERVICE: orders-service
DD_VERSION: "1.2.3" # matches git tag or image tagimport pino from "pino";
const logger = pino({
level: "info",
// dd-trace injects trace_id and span_id when logInjection: true
formatters: {
level: (label) => ({ level: label }),
},
redact: ["req.headers.authorization", "*.password"],
});
logger.info({ orderId, userId }, "order.created");Agent containerCollectAll: true tails stdout/stderr from all containers. Add processing rules to:
# agent config
logs_config:
processing_rules:
- type: exclude_at_match
name: exclude_healthcheck
pattern: "GET /healthz"
- type: mask_sequences
name: mask_tokens
replace_placeholder: "[REDACTED]"
pattern: "(Authorization: Bearer )[^\s]+"Convert high-cardinality logs into cost-effective metrics in Datadog UI:
service:orders-service status:error → metric custom.orders.errors# Create via API
curl -X POST "https://api.datadoghq.eu/api/v1/monitor" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
-H "Content-Type: application/json" \
-d '{
"name": "orders-service high error rate",
"type": "metric alert",
"query": "sum(last_5m):sum:trace.web.request.errors{service:orders-service,env:production}.as_count() / sum:trace.web.request.hits{service:orders-service,env:production}.as_count() > 0.05",
"message": "Error rate above 5% on orders-service @pagerduty-platform @slack-platform-alerts",
"tags": ["service:orders-service", "env:production", "team:platform"],
"options": {
"thresholds": {"critical": 0.05, "warning": 0.02},
"notify_no_data": true,
"no_data_timeframe": 10,
"renotify_interval": 30
}
}'resource "datadog_monitor" "orders_error_rate" {
name = "orders-service high error rate"
type = "metric alert"
message = "Error rate above 5% on orders-service @pagerduty-platform"
query = <<-EOQ
sum(last_5m):
sum:trace.web.request.errors{service:orders-service,env:production}.as_count()
/ sum:trace.web.request.hits{service:orders-service,env:production}.as_count()
> 0.05
EOQ
monitor_thresholds {
critical = 0.05
warning = 0.02
}
tags = ["service:orders-service", "env:production", "team:platform"]
notify_no_data = true
no_data_timeframe = 10
renotify_interval = 30
}resource "datadog_dashboard" "orders_service" {
title = "Orders Service — RED"
description = "Request rate, error rate, and latency for the orders service"
layout_type = "ordered"
widget {
timeseries_definition {
title = "Request Rate"
request {
q = "sum:trace.web.request.hits{service:orders-service,env:production}.as_rate()"
display_type = "line"
}
}
}
widget {
timeseries_definition {
title = "Error Rate %"
request {
q = "100 * sum:trace.web.request.errors{service:orders-service,env:production}.as_count() / sum:trace.web.request.hits{service:orders-service,env:production}.as_count()"
display_type = "line"
}
yaxis { min = "0" max = "100" }
}
}
widget {
timeseries_definition {
title = "p50/p95/p99 Latency"
request {
q = "p50:trace.web.request{service:orders-service,env:production}"
display_type = "line"
style { palette = "cool" }
}
request {
q = "p95:trace.web.request{service:orders-service,env:production}"
display_type = "line"
}
request {
q = "p99:trace.web.request{service:orders-service,env:production}"
display_type = "line"
}
}
}
}resource "datadog_service_level_objective" "orders_availability" {
name = "Orders Service Availability"
type = "metric"
description = "99.9% of requests succeed over a rolling 30-day window"
query {
numerator = "sum:trace.web.request.hits{service:orders-service,env:production,!status:error}.as_count()"
denominator = "sum:trace.web.request.hits{service:orders-service,env:production}.as_count()"
}
thresholds {
timeframe = "30d"
target = 99.9
warning = 99.95
}
tags = ["service:orders-service", "env:production"]
}resource "datadog_synthetics_test" "orders_api" {
name = "Orders API — create order"
type = "api"
subtype = "http"
status = "live"
locations = ["aws:eu-west-1", "aws:eu-central-1"]
message = "Orders API endpoint is failing @slack-platform-alerts"
request_definition {
method = "POST"
url = "https://api.example.com/orders"
body = jsonencode({ productId = "prod-smoke-test", quantity = 1 })
}
request_headers = {
"Content-Type" = "application/json"
"Authorization" = "Bearer {{orders-api-smoke-token}}"
}
assertion {
type = "statusCode"
operator = "is"
target = "201"
}
assertion {
type = "responseTime"
operator = "lessThan"
target = "1000"
}
options_list {
tick_every = 300 # every 5 minutes
}
}| Symptom | Evidence | Fix |
|---|---|---|
| Traces not appearing | agent status → APM section | Enable apm.portEnabled: true; check DD_TRACE_AGENT_HOSTNAME |
| Logs missing trace_id | Log entries lack dd.trace_id | Set logInjection: true in tracer init |
| No data in monitor | Monitor shows "No Data" | Check metric query resolves in Metrics Explorer first |
| High DDtrace overhead | Latency increase > 5ms | Reduce sampling rate: DD_TRACE_SAMPLE_RATE=0.1 |
| Pod not sending metrics | Agent shows 0 checks | Verify pod labels match datadog.checks annotations |
DD_API_KEY_SECRET_NAME to reference secret by name in the Helm chartdatadog.logs.containerCollectAll: true only if log volume is manageable — filter noisy sourcespup is a Rust-based CLI from Datadog Labs for scripting Datadog API interactions — monitors, logs, metrics, and more — without the GUI or Terraform.
# macOS (verify tap name at https://github.com/datadog-labs/pup)
brew install datadog/datadog-labs/pup
# Linux — check https://github.com/datadog-labs/pup/releases for the latest binary
# Example (verify URL before running):
# curl -sSL https://github.com/datadog-labs/pup/releases/latest/download/pup-linux-amd64 \
# -o /usr/local/bin/pup && chmod +x /usr/local/bin/pupexport DD_API_KEY="<your-api-key>"
export DD_APP_KEY="<your-app-key>"
export DD_SITE="datadoghq.eu" # or datadoghq.comOr use a profile file at ~/.config/pup/profiles.yaml:
default:
api_key: "${DD_API_KEY}"
app_key: "${DD_APP_KEY}"
site: datadoghq.eu# List all monitors currently in ALERT state
pup monitors list --status alert
# Get details of a specific monitor
pup monitors get --id 12345678
# Search logs for errors in the last 30 minutes
pup logs search \
--query "service:orders-service status:error" \
--from "now-30m" --to "now"
# Query a time series metric
pup metrics query \
--query "avg:trace.web.request{service:orders-service}" \
--from "now-1h" --to "now"
# Mute a monitor (during a deploy window)
pup monitors mute --id 12345678 --end "$(( $(date +%s) + 3600 ))" # epoch +1h, macOS and Linux
# Unmute after deploy
pup monitors unmute --id 12345678#!/usr/bin/env bash
# post-deploy: fail CI if error rate exceeds 5% within 5 minutes of deploy
set -euo pipefail
THRESHOLD=5
SERVICE="orders-service"
ENV="production"
sleep 300 # wait 5 minutes
ERROR_RATE=$(pup metrics query \
--query "100 * sum:trace.web.request.errors{service:${SERVICE},env:${ENV}}.as_count() / sum:trace.web.request.hits{service:${SERVICE},env:${ENV}}.as_count()" \
--from "now-5m" --to "now" \
--format json | jq '.series[0].pointlist[-1][1] // 0')
# If no data points returned (service dark), ERROR_RATE=0 — gate passes
# awk handles float comparison without requiring bc (works in Alpine/minimal CI images)
if awk "BEGIN { exit !($ERROR_RATE > $THRESHOLD) }"; then
echo "❌ Post-deploy error rate ${ERROR_RATE}% exceeds threshold ${THRESHOLD}%"
exit 1
fi
echo "✅ Post-deploy error rate ${ERROR_RATE}% is within threshold"| Symptom | Fix |
|---|---|
401 Unauthorized | Check DD_API_KEY and DD_APP_KEY are set and valid for your site |
403 Forbidden | APP key lacks required scope — add Monitors Read or Logs Read |
pup: command not found | Check brew link pup or verify /usr/local/bin is in PATH |
Empty results from logs search | Adjust --from/--to range; verify service tag matches log pipeline |
Datadog Labs publishes Claude skills that complement the official MCP server. Install them with:
claude plugin marketplace add https://github.com/datadog-labs/dd-pup
claude plugin marketplace add https://github.com/datadog-labs/dd-apm
claude plugin marketplace add https://github.com/datadog-labs/dd-logs
claude plugin marketplace add https://github.com/datadog-labs/dd-monitors
claude plugin marketplace add https://github.com/datadog-labs/dd-docs
claude plugin install dd-pup dd-apm dd-logs dd-monitors dd-docs| Skill | Invocation | What it does |
|---|---|---|
dd-pup | /dd-pup | Installs and configures the pup CLI; wraps common pup operations |
dd-apm | /dd-apm | Query APM data — traces, service maps, error rates, latency percentiles |
dd-logs | /dd-logs | Search, filter, archive, and analyze Datadog logs through pup |
dd-monitors | /dd-monitors | Create, update, mute, and resolve Datadog monitors through pup |
dd-docs | /dd-docs | Look up Datadog documentation via the LLM-optimized docs index |
| Use case | Recommended |
|---|---|
| Interactive incident investigation (live session) | Official MCP server |
| Scripting, CI/CD gates, automation | pup CLI + dd-pup skill |
| Querying APM data from editor without browser | dd-apm skill |
| Log search and analysis in editor | dd-logs skill |
| Monitor management without Terraform | dd-monitors skill |
| Looking up Datadog documentation | dd-docs skill |
Note: Labs skills use the
pupCLI under the hood — ensureDD_API_KEY,DD_APP_KEY, andDD_SITEare set in your shell before invoking them.
Datadog has a built-in FluxCD integration (bundled in Agent 7.51.0+, minimum Agent 7.49.1). It collects reconciliation metrics from all Flux controllers via OpenMetrics.
source-controllerkustomize-controllerhelm-controllernotification-controllerAnnotate Flux controller pods to enable metric scraping:
# Applied via FluxInstance spec.kustomize.patches
apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
kustomize:
patches:
- target:
kind: Deployment
labelSelector: "app.kubernetes.io/part-of=flux"
patch: |
- op: add
path: /spec/template/metadata/annotations/ad.datadoghq.com~1fluxcd.checks
value: |
{
"fluxcd": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8080/metrics"
}
]
}
}Disabled by default. Enable in the Datadog Agent DaemonSet:
logs:
enabled: true
config:
- source: fluxcd
service: "<controller-name>" # e.g. kustomize-controller| Metric | Description |
|---|---|
fluxcd.gotk.reconcile.duration.seconds | Reconciliation latency per controller and resource kind |
fluxcd.gotk.reconcile.condition | Reconciliation success/failure (Ready=True/False) |
fluxcd.workqueue.depth | Pending reconciliation queue depth |
fluxcd.workqueue.retries.total | Retry count — elevated values indicate failing resources |
fluxcd.controller.runtime.active_workers | Active reconciliation workers vs max concurrency |
fluxcd.process.cpu_seconds.total | Controller CPU usage |
fluxcd.process.resident_memory_bytes | Controller memory footprint |
fluxcd.leader_election.master_status | 1 = leader, 0 = standby |
HelmRelease reconciliation failure:
{
"name": "FluxCD HelmRelease reconciliation failing",
"type": "metric alert",
"query": "sum(last_5m):sum:fluxcd.gotk.reconcile.condition{ready:false,kind:helmrelease} by {name,namespace} > 0",
"message": "HelmRelease {{name.name}} in {{namespace.name}} is not Ready. Check: flux get helmrelease {{name.name}} -n {{namespace.name}}"
}Kustomization reconciliation failure:
{
"name": "FluxCD Kustomization reconciliation failing",
"type": "metric alert",
"query": "sum(last_5m):sum:fluxcd.gotk.reconcile.condition{ready:false,kind:kustomization} by {name,namespace} > 0",
"message": "Kustomization {{name.name}} in {{namespace.name}} is not Ready."
}Workqueue saturation:
{
"name": "FluxCD workqueue depth elevated",
"type": "metric alert",
"query": "avg(last_10m):avg:fluxcd.workqueue.depth{*} by {name} > 50",
"message": "FluxCD {{name.name}} workqueue is backing up — reconciliation may be slow."
}fluxcd.openmetrics.health — returns CRITICAL if the OpenMetrics endpoint is unreachable. Alert on this to detect controller pod restarts or port changes.
Import the community FluxCD dashboard from the Datadog integrations page, or build one with:
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests