tessl install github:jeremylongshore/claude-code-plugins-plus-skills --skill langfuse-incident-runbookgithub.com/jeremylongshore/claude-code-plugins-plus-skills
Troubleshoot and respond to Langfuse-related incidents and outages. Use when experiencing Langfuse outages, debugging production issues, or responding to LLM observability incidents. Trigger with phrases like "langfuse incident", "langfuse outage", "langfuse down", "langfuse production issue", "langfuse troubleshoot".
Review Score
88%
Validation Score
13/16
Implementation Score
85%
Activation Score
90%
Step-by-step procedures for responding to Langfuse-related incidents.
| Severity | Description | Response Time | Escalation |
|---|---|---|---|
| P1 | Complete outage, no traces | 15 min | Immediate |
| P2 | Degraded, partial data loss | 1 hour | 4 hours |
| P3 | Slow/delayed traces | 4 hours | Next business day |
| P4 | Minor issues, workaround exists | 24 hours | Best effort |
#!/bin/bash
# quick-diagnosis.sh
echo "=== Langfuse Quick Diagnosis ==="
echo "Time: $(date)"
echo ""
# 1. Check Langfuse status
echo "1. Langfuse Status:"
curl -s https://status.langfuse.com/api/v2/status.json | jq '.status.description'
# 2. Check API connectivity
echo ""
echo "2. API Connectivity:"
curl -s -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" \
https://cloud.langfuse.com/api/public/health
# 3. Check authentication
echo ""
echo "3. Auth Test:"
AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)
curl -s -o /dev/null -w "HTTP %{http_code}\n" \
-H "Authorization: Basic $AUTH" \
"https://cloud.langfuse.com/api/public/traces?limit=1"
# 4. Check application health
echo ""
echo "4. Application Metrics:"
curl -s http://localhost:3000/api/metrics | grep langfuse | head -5| Symptom | Likely Cause | Go To |
|---|---|---|
| No traces appearing | SDK not flushing | Section A |
| 401/403 errors | Authentication issue | Section B |
| High latency | Network/rate limits | Section C |
| Missing data | Partial failures | Section D |
| Complete outage | Langfuse service issue | Section E |
// 1. Verify SDK is enabled
console.log("Langfuse enabled:", process.env.LANGFUSE_ENABLED !== "false");
console.log("Environment:", process.env.NODE_ENV);
// 2. Check for pending events
// Add this to your code temporarily
const langfuse = getLangfuse();
console.log("Pending events:", langfuse.pendingItems?.length || "unknown");
// 3. Force flush and check for errors
try {
await langfuse.flushAsync();
console.log("Flush successful");
} catch (error) {
console.error("Flush failed:", error);
}Check shutdown handlers
// Ensure shutdown is registered
process.on("beforeExit", async () => {
await langfuse.shutdownAsync();
});Reduce batch size temporarily
const langfuse = new Langfuse({
flushAt: 1, // Immediate flush
flushInterval: 1000,
});Enable debug logging
DEBUG=langfuse* npm start# 1. Verify environment variables
echo "Public key starts with: ${LANGFUSE_PUBLIC_KEY:0:10}"
echo "Secret key is set: ${LANGFUSE_SECRET_KEY:+yes}"
echo "Host: ${LANGFUSE_HOST:-https://cloud.langfuse.com}"
# 2. Test credentials directly
curl -v -X GET \
-H "Authorization: Basic $(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)" \
"${LANGFUSE_HOST:-https://cloud.langfuse.com}/api/public/traces?limit=1"Verify keys match project
Check for key rotation
Verify host URL
https://cloud.langfuse.com// Check flush timing
const start = Date.now();
await langfuse.flushAsync();
console.log(`Flush took ${Date.now() - start}ms`);
// Check batch sizes
console.log("Current batch size:", langfuse.pendingItems?.length);# Network latency test
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://cloud.langfuse.com/api/public/healthFor rate limits
// Increase batching
const langfuse = new Langfuse({
flushAt: 50,
flushInterval: 10000,
});For network issues
Implement circuit breaker
class CircuitBreaker {
private failures = 0;
private lastFailure?: Date;
private readonly threshold = 5;
private readonly resetMs = 60000;
async execute<T>(operation: () => Promise<T>): Promise<T | null> {
if (this.isOpen()) {
console.warn("Circuit breaker open, skipping Langfuse");
return null;
}
try {
const result = await operation();
this.reset();
return result;
} catch (error) {
this.recordFailure();
throw error;
}
}
private isOpen(): boolean {
if (this.failures < this.threshold) return false;
if (!this.lastFailure) return false;
return Date.now() - this.lastFailure.getTime() < this.resetMs;
}
private recordFailure() {
this.failures++;
this.lastFailure = new Date();
}
private reset() {
this.failures = 0;
this.lastFailure = undefined;
}
}// Check for errors in trace operations
const trace = langfuse.trace({ name: "test" });
console.log("Trace ID:", trace.id);
const span = trace.span({ name: "test-span" });
console.log("Span ID:", span.id);
// Verify end() is called
span.end({ output: { test: true } });
console.log("Span ended");
await langfuse.flushAsync();
console.log("Flushed");Ensure all spans are ended
const span = trace.span({ name: "operation" });
try {
return await doWork();
} finally {
span.end(); // Always end in finally
}Check for exceptions swallowing
try {
await langfuse.flushAsync();
} catch (error) {
console.error("Langfuse flush error:", error);
// Don't swallow - log for debugging
}Check status page: https://status.langfuse.com
Enable fallback mode
// Graceful degradation
const langfuse = new Langfuse({
enabled: false, // Disable during outage
});Queue events locally
// Store events to file during outage
const pendingEvents: any[] = [];
function queueEvent(event: any) {
pendingEvents.push({
...event,
timestamp: new Date().toISOString(),
});
if (pendingEvents.length > 1000) {
// Write to file
fs.writeFileSync(
`langfuse-backup-${Date.now()}.json`,
JSON.stringify(pendingEvents)
);
pendingEvents.length = 0;
}
}Monitor for recovery
# Watch status
watch -n 30 'curl -s https://status.langfuse.com/api/v2/status.json | jq .status'| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | All incidents |
| L2 | Platform team lead | P1/P2 unresolved 30min |
| L3 | Langfuse support | Service-side issues |
For data export and retention, see langfuse-data-handling.