CtrlK
BlogDocsLog inGet started
Tessl Logo

firecrawl-incident-runbook

Execute FireCrawl incident response procedures with triage, mitigation, and postmortem. Use when responding to FireCrawl-related outages, investigating errors, or running post-incident reviews for FireCrawl integration failures. Trigger with phrases like "firecrawl incident", "firecrawl outage", "firecrawl down", "firecrawl on-call", "firecrawl emergency", "firecrawl broken".

Install with Tessl CLI

npx tessl i github:jeremylongshore/claude-code-plugins-plus-skills --skill firecrawl-incident-runbook
What are skills?

94

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

FireCrawl Incident Runbook

Overview

Rapid incident response procedures for FireCrawl-related outages.

Prerequisites

  • Access to FireCrawl dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Severity Levels

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minFireCrawl API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps

Quick Triage

# 1. Check FireCrawl status
curl -s https://status.firecrawl.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.firecrawl'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(firecrawl_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=firecrawl-integration --since=5m | grep -i error | tail -20

Decision Tree

FireCrawl API returning errors?
├─ YES: Is status.firecrawl.com showing incident?
│   ├─ YES → Wait for FireCrawl to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

Immediate Actions by Error Type

401/403 - Authentication

# Verify API key is set
kubectl get secret firecrawl-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in FireCrawl dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic firecrawl-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/firecrawl-integration

429 - Rate Limited

# Check rate limit headers
curl -v https://api.firecrawl.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/firecrawl-integration RATE_LIMIT_MODE=queue

# Long-term: Contact FireCrawl for limit increase

500/503 - FireCrawl Errors

# Enable graceful degradation
kubectl set env deployment/firecrawl-integration FIRECRAWL_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor FireCrawl status for resolution

Communication Templates

Internal (Slack)

🔴 P1 INCIDENT: FireCrawl Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

External (Status Page)

FireCrawl Integration Issue

We're experiencing issues with our FireCrawl integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

Post-Incident

Evidence Collection

# Generate debug bundle
./scripts/firecrawl-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=firecrawl-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=firecrawl_errors_total&start=2h" > metrics.json

Postmortem Template

## Incident: FireCrawl [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date

Instructions

Step 1: Quick Triage

Run the triage commands to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is FireCrawl-side or internal.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Output

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem

Error Handling

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin

Examples

One-Line Health Check

curl -sf https://api.yourapp.com/health | jq '.services.firecrawl.status' || echo "UNHEALTHY"

Resources

  • FireCrawl Status Page
  • FireCrawl Support

Next Steps

For data handling, see firecrawl-data-handling.

Repository
jeremylongshore/claude-code-plugins-plus-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.