CtrlK
BlogDocsLog inGet started
Tessl Logo

sre-engineer

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.

98

1.07x
Quality

100%

Does it follow best practices?

Impact

96%

1.07x

Average score across 6 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Evaluation results

100%

Define Reliability Targets for an E-Commerce Checkout API

SLO/SLI definition and error budget policy

Criteria
Without context
With context

Quantitative SLO targets

100%

100%

User impact justification

100%

100%

Error budget calculation

100%

100%

4xx counted as good events

100%

100%

SLO review checklist coverage

100%

100%

Tiered budget policy

100%

100%

Feature freeze on exhaustion

100%

100%

Velocity balance stated

100%

100%

Multiple SLO types

100%

100%

30-day measurement window

100%

100%

Prometheus/PromQL queries

100%

100%

Reliability impact explanation

100%

100%

96%

3%

Set Up Production Monitoring for a User-Facing API Service

Golden signals monitoring and burn rate alerting

Criteria
Without context
With context

All four golden signals

100%

100%

Latency percentiles

100%

100%

Fast burn rate threshold

80%

100%

Medium burn rate threshold

100%

100%

Multi-window burn rate detection

100%

100%

Runbook links in alerts

50%

50%

Runbook content present

100%

100%

Prometheus recording rules

100%

100%

Alert severity labels

100%

100%

Page-worthy criteria respected

87%

100%

Reliability impact explanation

100%

100%

Saturation threshold defined

100%

100%

86%

10%

Post-Incident Review and Operations Improvement Plan

Blameless postmortem and toil reduction planning

Criteria
Without context
With context

Blameless language

87%

100%

Postmortem impact section

75%

87%

UTC timeline present

100%

100%

Root cause section

100%

100%

Action items with owners

62%

100%

What went well / could improve

100%

100%

Toil measurement present

100%

100%

ROI-based prioritization

70%

100%

50% toil ceiling acknowledged

37%

37%

Automation plan produced

100%

100%

Runbook or remediation steps

40%

50%

Reliability impact explanation

60%

70%

98%

14%

Game Day: Resilience Validation for Order Processing Service

Chaos experiment design and safety constraints

Criteria
Without context
With context

Hypothesis defined

100%

100%

Blast radius scoped

100%

100%

Rollback plan present

87%

100%

Success criteria defined

62%

100%

Error rate abort threshold

50%

100%

Latency abort threshold

50%

100%

RTO/RPO verification

100%

91%

Recovery validation end-to-end

100%

100%

Rollback always executed

75%

100%

Reliability impact explanation

100%

87%

Multiple experiment scenarios

100%

100%

100%

Pre-Launch Reliability Review: Consumer Analytics Platform

Capacity planning and graceful degradation

Criteria
Without context
With context

Capacity forecast produced

100%

100%

Forecast horizon specified

100%

100%

Scaling recommendation included

100%

100%

Deployment decision tied to error budget

100%

100%

Budget-based deployment gates

100%

100%

Graceful degradation design

100%

100%

Exception process defined

100%

100%

Threshold for scale-up action

100%

100%

Reliability impact explanation

100%

100%

Monitoring config produced

100%

100%

Automation script produced

100%

100%

96%

10%

On-Call Readiness: Incident Response Playbook and Self-Healing Automation

Incident response roles and post-resolution workflow

Criteria
Without context
With context

SEV1-SEV4 classification

100%

100%

Incident commander role

100%

100%

Communication lead role

100%

100%

On-call engineer role

37%

100%

Post-resolution monitoring window

50%

100%

Postmortem scheduled within 48 hours

80%

100%

Retry logic in self-healing

100%

100%

Escalation on remediation failure

100%

100%

Severity-differentiated response

100%

100%

Reliability impact explanation

75%

50%

Runbook or remediation steps

100%

100%

Repository
jeffallan/claude-skills
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.