CtrlK
BlogDocsLog inGet started
Tessl Logo

langfuse-incident-runbook

Troubleshoot and respond to Langfuse-related incidents and outages. Use when experiencing Langfuse outages, debugging production issues, or responding to LLM observability incidents. Trigger with phrases like "langfuse incident", "langfuse outage", "langfuse down", "langfuse production issue", "langfuse troubleshoot".

85

Quality

83%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Langfuse Incident Runbook

Overview

Step-by-step procedures for Langfuse-related incidents, from initial triage (2 min) through resolution and post-incident review. Your application should work without Langfuse -- these procedures focus on restoring observability.

Severity Classification

SeverityDescriptionResponse TimeExample
P1Application impacted by tracing15 minSDK throwing unhandled errors, blocking requests
P2Traces not appearing, no app impact1 hourMissing observability data
P3Degraded performance from tracing4 hoursHigh latency from flush backlog
P4Minor issues24 hoursOccasional missing traces

Instructions

Step 1: Initial Assessment (2 Minutes)

set -euo pipefail
echo "=== Langfuse Incident Triage ==="
echo "Time: $(date -u)"

# 1. Check Langfuse cloud status
echo -n "Status page: "
curl -s -o /dev/null -w "%{http_code}" https://status.langfuse.com || echo "UNREACHABLE"
echo ""

# 2. Test API connectivity
HOST="${LANGFUSE_BASE_URL:-${LANGFUSE_HOST:-https://cloud.langfuse.com}}"
echo -n "API health: "
curl -s -o /dev/null -w "%{http_code} (%{time_total}s)" "$HOST/api/public/health" || echo "FAILED"
echo ""

# 3. Test auth
if [ -n "${LANGFUSE_PUBLIC_KEY:-}" ] && [ -n "${LANGFUSE_SECRET_KEY:-}" ]; then
  AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)
  echo -n "Auth test: "
  curl -s -o /dev/null -w "%{http_code}" \
    -H "Authorization: Basic $AUTH" "$HOST/api/public/traces?limit=1" || echo "FAILED"
  echo ""
fi

# 4. Check app error logs
echo ""
echo "--- Recent errors ---"
grep -i "langfuse\|trace.*error\|flush.*fail" /var/log/app/*.log 2>/dev/null | tail -10 || echo "No log files found"

Step 2: Determine Incident Type and Response

SymptomLikely CauseImmediate Action
No traces appearingSDK not flushingCheck shutdown handlers; set flushAt: 1 temporarily
401 UnauthorizedKey rotation or mismatchVerify keys match the correct project
429 Too Many RequestsRate limitedIncrease batch size, reduce flush frequency
SDK throwing errorsUnhandled exceptionWrap in try/catch; check SDK version
High request latencySync flush in hot pathSwitch to async; increase requestTimeout
Complete Langfuse outageService-side issueEnable fallback mode

Step 3: Fallback Mode (P1 -- App Impacted)

If Langfuse is causing application issues, disable tracing immediately:

// Emergency disable via environment variable
// Set LANGFUSE_ENABLED=false in your deployment

// In your tracing initialization:
if (process.env.LANGFUSE_ENABLED === "false") {
  console.warn("Langfuse tracing DISABLED (emergency fallback)");
  // Don't initialize SDK -- all observe/startActiveObservation calls
  // will still work but produce no-op spans
}

For v3, use the enabled flag:

const langfuse = new Langfuse({
  enabled: process.env.LANGFUSE_ENABLED !== "false",
});

Step 4: Common Resolution Procedures

Procedure A: Missing Traces

// 1. Verify SDK is initialized
console.log("Langfuse configured:", !!process.env.LANGFUSE_PUBLIC_KEY);

// 2. Check flush is happening
// v4+: Verify NodeSDK is started and shutdown is registered
// v3: Verify flushAsync() or shutdownAsync() is called

// 3. Temporarily set aggressive flush for debugging
const processor = new LangfuseSpanProcessor({
  exportIntervalMillis: 1000,
  maxExportBatchSize: 1,
});

Procedure B: Rate Limit (429) Recovery

// Increase batching to reduce API calls
const processor = new LangfuseSpanProcessor({
  exportIntervalMillis: 30000, // 30s flush
  maxExportBatchSize: 200,     // Large batches
});

// Or temporarily enable sampling
const EMERGENCY_SAMPLE_RATE = 0.1; // Only trace 10%

Procedure C: Self-Hosted Instance Down

set -euo pipefail
# Check container status
docker ps -a | grep langfuse

# Check logs
docker logs langfuse-langfuse-1 --tail 50

# Check database
docker exec langfuse-postgres-1 pg_isready -U langfuse

# Restart if needed
docker compose restart langfuse

Step 5: Post-Incident Verification

set -euo pipefail
# Verify traces are flowing again
echo "=== Post-Incident Check ==="

HOST="${LANGFUSE_BASE_URL:-https://cloud.langfuse.com}"
AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)

# Check recent trace count
TRACE_COUNT=$(curl -s \
  -H "Authorization: Basic $AUTH" \
  "$HOST/api/public/traces?limit=5" | python3 -c "import sys,json; print(len(json.load(sys.stdin).get('data',[])))" 2>/dev/null || echo "ERROR")

echo "Recent traces: $TRACE_COUNT"

if [ "$TRACE_COUNT" = "0" ] || [ "$TRACE_COUNT" = "ERROR" ]; then
  echo "WARNING: Traces may not be flowing yet"
else
  echo "OK: Traces are appearing"
fi

Step 6: Post-Incident Review (P1/P2)

Document for post-mortem:

  1. Timeline: When detected, when resolved, total duration
  2. Impact: Traces lost, application impact, user impact
  3. Root cause: Why did the incident occur?
  4. Resolution: What fixed it?
  5. Prevention: What changes prevent recurrence?
  6. Action items: Improvements to implement

Escalation Path

LevelWhoWhen
L1On-call engineerAll incidents -- run triage
L2Platform team leadP1/P2 unresolved after 30 min
L3Langfuse supportConfirmed service-side issue

Langfuse support channels:

  • Status Page -- check first
  • Discord -- community support
  • GitHub Issues -- bug reports
  • Email support (enterprise customers)

Error Handling

IssueImmediate FixPermanent Fix
SDK crashes appSet LANGFUSE_ENABLED=falseWrap all tracing in try/catch
Lost tracesIncrease batch sizeAdd shutdown handlers
High latencyDisable sync flushUse async-only patterns
Auth failuresRotate and redeploy keysAdd key validation at startup

Resources

Repository
jeremylongshore/claude-code-plugins-plus-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.