Configures and maintains cloud infrastructure, monitoring systems, and automated operations. Sets up Prometheus/Grafana alerting, writes Terraform IaC for AWS/GCP/Azure resources, manages auto-scaling groups, load balancers, VPCs, and databases, and implements encrypted backup pipelines with S3 offload. Troubleshoots deployment failures, optimizes resource right-sizing, enforces security hardening (SOC2/ISO27001), and produces capacity planning and cost analysis reports. Use when the user asks about server setup, infrastructure provisioning, monitoring and alerting configuration, Terraform or CloudFormation, CI/CD pipeline issues, Kubernetes or container orchestration, cloud cost optimization, backup and disaster recovery, or database performance and scaling.
98
100%
Does it follow best practices?
Impact
98%
1.42xAverage score across 3 eval scenarios
Passed
No known issues
Expert infrastructure specialist for system reliability, performance optimization, cost management, and security compliance. Focuses on cloud architecture, monitoring, IaC automation, and backup/recovery.
Detailed, ready-to-use configs live in separate reference files — consult them for full resource definitions:
references/prometheus-config.yml — Prometheus scrape jobs (node exporter, app :8080, PostgreSQL exporter), evaluation intervals, Alertmanager endpoint, and alert rules for CPU >80%, memory >90%, disk >85%, and service-down conditions.references/terraform-aws.tf — S3-backed remote state, VPC with public/private subnets across AZs, launch template, auto-scaling group with ELB health checks, and RDS PostgreSQL instance with encryption, Performance Insights, and automated backups.references/backup.sh — Encrypted (gpg --cipher-algo AES256, SHA512 key derivation) database (pg_dump) and filesystem (tar) backup script with S3 upload (STANDARD_IA), integrity verification, 30-day local retention, and Slack webhook notifications. Secrets via environment variables — never hard-coded.references/infra-report.md — Health report template covering uptime, MTTR, P95 latency, cost breakdown by category, security/compliance status, and a prioritized action-item table.Follow this sequence for any infrastructure change (provisioning, scaling, security hardening, config update).
terraform show -json > pre_change_state.json
aws cloudwatch get-metric-statistics ... # baseline CPU/memory/error rates
kubectl get pods --all-namespaces # if applicableterraform fmt && terraform validate
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary | jq '.resource_changes[].change.actions'
# Review all resource changes; confirm no unintended deletions# Staging first
terraform apply tfplan.binary
./scripts/health_check.sh staging
curl -f https://staging.example.com/health || exit 1
# Production only after staging passes
TF_WORKSPACE=production terraform apply -auto-approve tfplan.binary# Confirm all targets healthy
aws elbv2 describe-target-health --target-group-arn "$TG_ARN" \
| jq '.TargetHealthDescriptions[].TargetHealth.State'
# Expected: all "healthy"
# Verify no services are down in Prometheus
curl -s 'http://prometheus:9090/api/v1/query?query=up' \
| jq '.data.result[] | select(.value[1]=="0") | .metric.job'
# Expected: empty
# Check error rate baseline unchanged
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])"# Revert Terraform state
terraform workspace select <previous>
terraform apply -target=<affected_resource> previous_tfplan.binary
# ASG rollback: revert launch template version
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name app-asg \
--launch-template "LaunchTemplateName=app-template,Version=<previous_version>"
# Force instance refresh to previous config
aws autoscaling start-instance-refresh --auto-scaling-group-name app-asg \
--preferences '{"MinHealthyPercentage":90}'010799b
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.