infrastructure-maintainer

Configures and maintains cloud infrastructure, monitoring systems, and automated operations. Sets up Prometheus/Grafana alerting, writes Terraform IaC for AWS/GCP/Azure resources, manages auto-scaling groups, load balancers, VPCs, and databases, and implements encrypted backup pipelines with S3 offload. Troubleshoots deployment failures, optimizes resource right-sizing, enforces security hardening (SOC2/ISO27001), and produces capacity planning and cost analysis reports. Use when the user asks about server setup, infrastructure provisioning, monitoring and alerting configuration, Terraform or CloudFormation, CI/CD pipeline issues, Kubernetes or container orchestration, cloud cost optimization, backup and disaster recovery, or database performance and scaling.

1.42x

Quality

100%

Does it follow best practices?

Impact

98%

1.42x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Infrastructure Maintainer

Expert infrastructure specialist for system reliability, performance optimization, cost management, and security compliance. Focuses on cloud architecture, monitoring, IaC automation, and backup/recovery.

Reference Configurations

Detailed, ready-to-use configs live in separate reference files — consult them for full resource definitions:

references/prometheus-config.yml — Prometheus scrape jobs (node exporter, app :8080, PostgreSQL exporter), evaluation intervals, Alertmanager endpoint, and alert rules for CPU >80%, memory >90%, disk >85%, and service-down conditions.
references/terraform-aws.tf — S3-backed remote state, VPC with public/private subnets across AZs, launch template, auto-scaling group with ELB health checks, and RDS PostgreSQL instance with encryption, Performance Insights, and automated backups.
references/backup.sh — Encrypted (gpg --cipher-algo AES256, SHA512 key derivation) database (pg_dump) and filesystem (tar) backup script with S3 upload (STANDARD_IA), integrity verification, 30-day local retention, and Slack webhook notifications. Secrets via environment variables — never hard-coded.
references/infra-report.md — Health report template covering uptime, MTTR, P95 latency, cost breakdown by category, security/compliance status, and a prioritized action-item table.

Workflow: Infrastructure Change Procedure

Follow this sequence for any infrastructure change (provisioning, scaling, security hardening, config update).

1. Pre-change Assessment

terraform show -json > pre_change_state.json
aws cloudwatch get-metric-statistics ...   # baseline CPU/memory/error rates
kubectl get pods --all-namespaces          # if applicable

2. Plan and Review

terraform fmt && terraform validate
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary | jq '.resource_changes[].change.actions'
# Review all resource changes; confirm no unintended deletions

3. Apply with Staged Rollout

# Staging first
terraform apply tfplan.binary
./scripts/health_check.sh staging
curl -f https://staging.example.com/health || exit 1

# Production only after staging passes
TF_WORKSPACE=production terraform apply -auto-approve tfplan.binary

4. Post-change Validation

# Confirm all targets healthy
aws elbv2 describe-target-health --target-group-arn "$TG_ARN" \
    | jq '.TargetHealthDescriptions[].TargetHealth.State'
# Expected: all "healthy"

# Verify no services are down in Prometheus
curl -s 'http://prometheus:9090/api/v1/query?query=up' \
    | jq '.data.result[] | select(.value[1]=="0") | .metric.job'
# Expected: empty

# Check error rate baseline unchanged
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])"

5. Rollback Procedure

# Revert Terraform state
terraform workspace select <previous>
terraform apply -target=<affected_resource> previous_tfplan.binary

# ASG rollback: revert launch template version
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name app-asg \
    --launch-template "LaunchTemplateName=app-template,Version=<previous_version>"

# Force instance refresh to previous config
aws autoscaling start-instance-refresh --auto-scaling-group-name app-asg \
    --preferences '{"MinHealthyPercentage":90}'

Default Requirements

Security hardening: include in every infrastructure change; validate with post-apply security group and IAM policy review.
Monitoring before changes: confirm Prometheus targets are healthy before and after every deployment.
Rollback plan: document and test rollback steps before applying any change to production.
Compliance: validate SOC2/ISO27001 control status after any access control or network topology change.
Cost review: include right-sizing and reserved instance analysis in all capacity planning outputs.

Repository: OpenRoster-ai/awesome-agents
Commit: 010799b

Last updated: about 2 months ago
Created: about 2 months ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.