CtrlK
BlogDocsLog inGet started
Tessl Logo

langchain-incident-runbook

Incident response procedures for LangChain production issues: provider outages, high error rates, latency spikes, and cost overruns. Trigger: "langchain incident", "langchain outage", "langchain production issue", "langchain emergency", "langchain down", "LLM provider outage".

63

Quality

76%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/saas-packs/langchain-pack/skills/langchain-incident-runbook/SKILL.md
SKILL.md
Quality
Evals
Security

LangChain Incident Runbook

Overview

Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.

Severity Classification

LevelDescriptionResponse TimeExample
SEV1Complete outage15 minAll LLM calls failing
SEV2Major degradation30 min>50% error rate, >10s latency
SEV3Minor degradation2 hours<10% errors, slow responses
SEV4Low impact24 hoursIntermittent issues, warnings

Runbook 1: LLM Provider Outage

Detect

# Check provider status pages
curl -s https://status.openai.com/api/v2/status.json | jq '.status'
curl -s https://status.anthropic.com/api/v2/status.json | jq '.status'

Diagnose

async function diagnoseProviders() {
  const results: Record<string, string> = {};

  try {
    const openai = new ChatOpenAI({ model: "gpt-4o-mini", timeout: 10000 });
    await openai.invoke("ping");
    results.openai = "OK";
  } catch (e: any) {
    results.openai = `FAIL: ${e.message.slice(0, 100)}`;
  }

  try {
    const anthropic = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });
    await anthropic.invoke("ping");
    results.anthropic = "OK";
  } catch (e: any) {
    results.anthropic = `FAIL: ${e.message.slice(0, 100)}`;
  }

  console.table(results);
  return results;
}

Mitigate

// Enable fallback — switch to healthy provider
const primary = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 1,
  timeout: 5000,
});

const fallback = new ChatAnthropic({
  model: "claude-sonnet-4-20250514",
  maxRetries: 1,
});

const resilientModel = primary.withFallbacks({
  fallbacks: [fallback],
});

// All chains using resilientModel auto-failover

Recover

  1. Monitor provider status page for resolution
  2. Verify primary provider works: await diagnoseProviders()
  3. Remove fallback config (or keep it for resilience)
  4. Document incident timeline for post-mortem

Runbook 2: High Error Rate

Detect

# Check LangSmith for error spike
# https://smith.langchain.com/o/YOUR_ORG/projects/YOUR_PROJECT/runs?filter=error:true

# Check application logs
grep -c "Error\|error\|ERROR" /var/log/app/langchain.log | tail -5

Diagnose

// Common error patterns
const ERROR_CAUSES: Record<string, string> = {
  "RateLimitError":     "API quota exceeded -> reduce concurrency",
  "AuthenticationError": "API key invalid -> check secrets",
  "Timeout":            "Provider slow -> increase timeout",
  "OutputParserException": "LLM output format changed -> check prompts",
  "ValidationError":    "Schema mismatch -> update Zod schemas",
  "ContextLengthExceeded": "Input too long -> truncate or chunk",
};

Mitigate

// 1. Reduce load
// Lower maxConcurrency on batch operations

// 2. Enable caching for repeated queries
const cache = new Map();
async function withCache(chain: any, input: any) {
  const key = JSON.stringify(input);
  if (cache.has(key)) return cache.get(key);
  const result = await chain.invoke(input);
  cache.set(key, result);
  return result;
}

// 3. Enable fallback model
const model = primary.withFallbacks({ fallbacks: [fallback] });

Runbook 3: Latency Spike

Detect

# Prometheus query
histogram_quantile(0.95, rate(langchain_llm_latency_seconds_bucket[5m])) > 5

Diagnose

// Measure per-component latency
const tracer = new MetricsCallback();
await chain.invoke({ input: "test" }, { callbacks: [tracer] });
console.table(tracer.getReport());
// Check: is it the LLM, retriever, or tool that's slow?

Mitigate

  1. Switch to faster model: gpt-4o-mini (200ms TTFT) vs gpt-4o (400ms)
  2. Enable streaming to reduce perceived latency
  3. Enable caching for repeated queries
  4. Reduce context length (shorter prompts)

Runbook 4: Cost Overrun

Detect

# Check OpenAI usage dashboard
# https://platform.openai.com/usage

Mitigate

// 1. Emergency model downgrade
// gpt-4o ($2.50/1M) -> gpt-4o-mini ($0.15/1M) = 17x cheaper

// 2. Enable budget enforcement
const budget = new BudgetEnforcer(50.0); // $50 daily limit
const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  callbacks: [budget],
});

// 3. Enable aggressive caching
// (see langchain-cost-tuning skill)

Runbook 5: Memory/OOM Issues

Detect

# Check process memory
ps aux --sort=-%mem | head -5

# Node.js heap stats
node -e "console.log(process.memoryUsage())"

Mitigate

  1. Clear caches: reset in-memory caches
  2. Reduce batch sizes: lower maxConcurrency
  3. Use streaming instead of accumulating full responses
  4. Restart pods: kubectl rollout restart deployment/langchain-api

Incident Response Checklist

During Incident

  • Acknowledge in incident channel
  • Classify severity (SEV1-4)
  • Check provider status pages
  • Run diagnostic script
  • Apply mitigation (fallback/cache/throttle)
  • Communicate status to stakeholders
  • Document timeline

Post-Incident

  • Verify full recovery
  • Schedule post-mortem (within 48h)
  • Write incident report
  • Create follow-up tickets
  • Update monitoring/alerting rules
  • Update this runbook if needed

Resources

  • OpenAI Status
  • Anthropic Status
  • LangSmith
  • PagerDuty Best Practices

Next Steps

Use langchain-debug-bundle for detailed evidence collection during incidents.

Repository
jeremylongshore/claude-code-plugins-plus-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.