langfuse-incident-runbook

Troubleshoot and respond to Langfuse-related incidents and outages. Use when experiencing Langfuse outages, debugging production issues, or responding to LLM observability incidents. Trigger with phrases like "langfuse incident", "langfuse outage", "langfuse down", "langfuse production issue", "langfuse troubleshoot".

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Langfuse Incident Runbook

Overview

Step-by-step procedures for Langfuse-related incidents, from initial triage (2 min) through resolution and post-incident review. Your application should work without Langfuse -- these procedures focus on restoring observability.

Severity Classification

Severity	Description	Response Time	Example
P1	Application impacted by tracing	15 min	SDK throwing unhandled errors, blocking requests
P2	Traces not appearing, no app impact	1 hour	Missing observability data
P3	Degraded performance from tracing	4 hours	High latency from flush backlog
P4	Minor issues	24 hours	Occasional missing traces

Instructions

Step 1: Initial Assessment (2 Minutes)

set -euo pipefail
echo "=== Langfuse Incident Triage ==="
echo "Time: $(date -u)"

# 1. Check Langfuse cloud status
echo -n "Status page: "
curl -s -o /dev/null -w "%{http_code}" https://status.langfuse.com || echo "UNREACHABLE"
echo ""

# 2. Test API connectivity
HOST="${LANGFUSE_BASE_URL:-${LANGFUSE_HOST:-https://cloud.langfuse.com}}"
echo -n "API health: "
curl -s -o /dev/null -w "%{http_code} (%{time_total}s)" "$HOST/api/public/health" || echo "FAILED"
echo ""

# 3. Test auth
if [ -n "${LANGFUSE_PUBLIC_KEY:-}" ] && [ -n "${LANGFUSE_SECRET_KEY:-}" ]; then
  AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)
  echo -n "Auth test: "
  curl -s -o /dev/null -w "%{http_code}" \
    -H "Authorization: Basic $AUTH" "$HOST/api/public/traces?limit=1" || echo "FAILED"
  echo ""
fi

# 4. Check app error logs
echo ""
echo "--- Recent errors ---"
grep -i "langfuse\|trace.*error\|flush.*fail" /var/log/app/*.log 2>/dev/null | tail -10 || echo "No log files found"

Step 2: Determine Incident Type and Response

Symptom	Likely Cause	Immediate Action
No traces appearing	SDK not flushing	Check shutdown handlers; set `flushAt: 1` temporarily
`401 Unauthorized`	Key rotation or mismatch	Verify keys match the correct project
`429 Too Many Requests`	Rate limited	Increase batch size, reduce flush frequency
SDK throwing errors	Unhandled exception	Wrap in try/catch; check SDK version
High request latency	Sync flush in hot path	Switch to async; increase `requestTimeout`
Complete Langfuse outage	Service-side issue	Enable fallback mode

Step 3: Fallback Mode (P1 -- App Impacted)

If Langfuse is causing application issues, disable tracing immediately:

// Emergency disable via environment variable
// Set LANGFUSE_ENABLED=false in your deployment

// In your tracing initialization:
if (process.env.LANGFUSE_ENABLED === "false") {
  console.warn("Langfuse tracing DISABLED (emergency fallback)");
  // Don't initialize SDK -- all observe/startActiveObservation calls
  // will still work but produce no-op spans
}

For v3, use the enabled flag:

const langfuse = new Langfuse({
  enabled: process.env.LANGFUSE_ENABLED !== "false",
});

Step 4: Common Resolution Procedures

Procedure A: Missing Traces

// 1. Verify SDK is initialized
console.log("Langfuse configured:", !!process.env.LANGFUSE_PUBLIC_KEY);

// 2. Check flush is happening
// v4+: Verify NodeSDK is started and shutdown is registered
// v3: Verify flushAsync() or shutdownAsync() is called

// 3. Temporarily set aggressive flush for debugging
const processor = new LangfuseSpanProcessor({
  exportIntervalMillis: 1000,
  maxExportBatchSize: 1,
});

Procedure B: Rate Limit (429) Recovery

// Increase batching to reduce API calls
const processor = new LangfuseSpanProcessor({
  exportIntervalMillis: 30000, // 30s flush
  maxExportBatchSize: 200,     // Large batches
});

// Or temporarily enable sampling
const EMERGENCY_SAMPLE_RATE = 0.1; // Only trace 10%

Procedure C: Self-Hosted Instance Down

set -euo pipefail
# Check container status
docker ps -a | grep langfuse

# Check logs
docker logs langfuse-langfuse-1 --tail 50

# Check database
docker exec langfuse-postgres-1 pg_isready -U langfuse

# Restart if needed
docker compose restart langfuse

Step 5: Post-Incident Verification

set -euo pipefail
# Verify traces are flowing again
echo "=== Post-Incident Check ==="

HOST="${LANGFUSE_BASE_URL:-https://cloud.langfuse.com}"
AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)

# Check recent trace count
TRACE_COUNT=$(curl -s \
  -H "Authorization: Basic $AUTH" \
  "$HOST/api/public/traces?limit=5" | python3 -c "import sys,json; print(len(json.load(sys.stdin).get('data',[])))" 2>/dev/null || echo "ERROR")

echo "Recent traces: $TRACE_COUNT"

if [ "$TRACE_COUNT" = "0" ] || [ "$TRACE_COUNT" = "ERROR" ]; then
  echo "WARNING: Traces may not be flowing yet"
else
  echo "OK: Traces are appearing"
fi

Step 6: Post-Incident Review (P1/P2)

Document for post-mortem:

Timeline: When detected, when resolved, total duration
Impact: Traces lost, application impact, user impact
Root cause: Why did the incident occur?
Resolution: What fixed it?
Prevention: What changes prevent recurrence?
Action items: Improvements to implement

Escalation Path

Level	Who	When
L1	On-call engineer	All incidents -- run triage
L2	Platform team lead	P1/P2 unresolved after 30 min
L3	Langfuse support	Confirmed service-side issue

Langfuse support channels:

Status Page -- check first
Discord -- community support
GitHub Issues -- bug reports
Email support (enterprise customers)

Error Handling

Issue	Immediate Fix	Permanent Fix
SDK crashes app	Set `LANGFUSE_ENABLED=false`	Wrap all tracing in try/catch
Lost traces	Increase batch size	Add shutdown handlers
High latency	Disable sync flush	Use async-only patterns
Auth failures	Rotate and redeploy keys	Add key validation at startup

Resources

Repository: jeremylongshore/claude-code-plugins-plus-skills
Commit: 3022dd3

Last updated: 1 day ago
Created: 1 day ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.