Configures and maintains cloud infrastructure, monitoring systems, and automated operations. Sets up Prometheus/Grafana alerting, writes Terraform IaC for AWS/GCP/Azure resources, manages auto-scaling groups, load balancers, VPCs, and databases, and implements encrypted backup pipelines with S3 offload. Troubleshoots deployment failures, optimizes resource right-sizing, enforces security hardening (SOC2/ISO27001), and produces capacity planning and cost analysis reports. Use when the user asks about server setup, infrastructure provisioning, monitoring and alerting configuration, Terraform or CloudFormation, CI/CD pipeline issues, Kubernetes or container orchestration, cloud cost optimization, backup and disaster recovery, or database performance and scaling.
99
100%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Expert infrastructure specialist for system reliability, performance optimization, cost management, and security compliance. Focuses on cloud architecture, monitoring, IaC automation, and backup/recovery.
Detailed, ready-to-use configs live in separate reference files — consult them for full resource definitions:
references/prometheus-config.yml — Prometheus scrape jobs (node exporter, app :8080, PostgreSQL exporter), evaluation intervals, Alertmanager endpoint, and alert rules for CPU >80%, memory >90%, disk >85%, and service-down conditions.references/terraform-aws.tf — S3-backed remote state, VPC with public/private subnets across AZs, launch template, auto-scaling group with ELB health checks, and RDS PostgreSQL instance with encryption, Performance Insights, and automated backups.references/backup.sh — Encrypted (gpg --cipher-algo AES256, SHA512 key derivation) database (pg_dump) and filesystem (tar) backup script with S3 upload (STANDARD_IA), integrity verification, 30-day local retention, and Slack webhook notifications. Secrets via environment variables — never hard-coded.references/infra-report.md — Health report template covering uptime, MTTR, P95 latency, cost breakdown by category, security/compliance status, and a prioritized action-item table.Follow this sequence for any infrastructure change (provisioning, scaling, security hardening, config update).
terraform show -json > pre_change_state.json
aws cloudwatch get-metric-statistics ... # baseline CPU/memory/error rates
kubectl get pods --all-namespaces # if applicableterraform fmt && terraform validate
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary | jq '.resource_changes[].change.actions'
# Review all resource changes; confirm no unintended deletions# Staging first
terraform apply tfplan.binary
./scripts/health_check.sh staging
curl -f https://staging.example.com/health || exit 1
# Production only after staging passes
TF_WORKSPACE=production terraform apply -auto-approve tfplan.binary# Confirm all targets healthy
aws elbv2 describe-target-health --target-group-arn "$TG_ARN" \
| jq '.TargetHealthDescriptions[].TargetHealth.State'
# Expected: all "healthy"
# Verify no services are down in Prometheus
curl -s 'http://prometheus:9090/api/v1/query?query=up' \
| jq '.data.result[] | select(.value[1]=="0") | .metric.job'
# Expected: empty
# Check error rate baseline unchanged
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])"# Revert Terraform state
terraform workspace select <previous>
terraform apply -target=<affected_resource> previous_tfplan.binary
# ASG rollback: revert launch template version
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name app-asg \
--launch-template "LaunchTemplateName=app-template,Version=<previous_version>"
# Force instance refresh to previous config
aws autoscaling start-instance-refresh --auto-scaling-group-name app-asg \
--preferences '{"MinHealthyPercentage":90}'010799b
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.