Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.
98
100%
Does it follow best practices?
Impact
96%
1.07xAverage score across 6 eval scenarios
Passed
No known issues
SLO/SLI definition and error budget policy
Quantitative SLO targets
100%
100%
User impact justification
100%
100%
Error budget calculation
100%
100%
4xx counted as good events
100%
100%
SLO review checklist coverage
100%
100%
Tiered budget policy
100%
100%
Feature freeze on exhaustion
100%
100%
Velocity balance stated
100%
100%
Multiple SLO types
100%
100%
30-day measurement window
100%
100%
Prometheus/PromQL queries
100%
100%
Reliability impact explanation
100%
100%
Golden signals monitoring and burn rate alerting
All four golden signals
100%
100%
Latency percentiles
100%
100%
Fast burn rate threshold
80%
100%
Medium burn rate threshold
100%
100%
Multi-window burn rate detection
100%
100%
Runbook links in alerts
50%
50%
Runbook content present
100%
100%
Prometheus recording rules
100%
100%
Alert severity labels
100%
100%
Page-worthy criteria respected
87%
100%
Reliability impact explanation
100%
100%
Saturation threshold defined
100%
100%
Blameless postmortem and toil reduction planning
Blameless language
87%
100%
Postmortem impact section
75%
87%
UTC timeline present
100%
100%
Root cause section
100%
100%
Action items with owners
62%
100%
What went well / could improve
100%
100%
Toil measurement present
100%
100%
ROI-based prioritization
70%
100%
50% toil ceiling acknowledged
37%
37%
Automation plan produced
100%
100%
Runbook or remediation steps
40%
50%
Reliability impact explanation
60%
70%
Chaos experiment design and safety constraints
Hypothesis defined
100%
100%
Blast radius scoped
100%
100%
Rollback plan present
87%
100%
Success criteria defined
62%
100%
Error rate abort threshold
50%
100%
Latency abort threshold
50%
100%
RTO/RPO verification
100%
91%
Recovery validation end-to-end
100%
100%
Rollback always executed
75%
100%
Reliability impact explanation
100%
87%
Multiple experiment scenarios
100%
100%
Capacity planning and graceful degradation
Capacity forecast produced
100%
100%
Forecast horizon specified
100%
100%
Scaling recommendation included
100%
100%
Deployment decision tied to error budget
100%
100%
Budget-based deployment gates
100%
100%
Graceful degradation design
100%
100%
Exception process defined
100%
100%
Threshold for scale-up action
100%
100%
Reliability impact explanation
100%
100%
Monitoring config produced
100%
100%
Automation script produced
100%
100%
Incident response roles and post-resolution workflow
SEV1-SEV4 classification
100%
100%
Incident commander role
100%
100%
Communication lead role
100%
100%
On-call engineer role
37%
100%
Post-resolution monitoring window
50%
100%
Postmortem scheduled within 48 hours
80%
100%
Retry logic in self-healing
100%
100%
Escalation on remediation failure
100%
100%
Severity-differentiated response
100%
100%
Reliability impact explanation
75%
50%
Runbook or remediation steps
100%
100%
5b76101
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.