Use when you need to manage infrastructure across multiple servers interactively via wsh — deploying applications, configuring services, managing packages, performing rolling updates, and handling the prompts and judgment calls that declarative tools cannot. Examples: "deploy this application across 10 servers with health checks between each", "upgrade packages across the fleet and handle diverse prompts", "inspect and modify configuration across servers", "roll back a failed deployment".
62
73%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/infrastructure-ops/SKILL.mdIMPORTANT: EXECUTION CONTEXT This skill describes what to do — domain patterns and decision-making. It does NOT describe how to call the API.
- If you have
wsh_*tools (check your toolkit forwsh_send_input,wsh_get_screen, etc.): use them directly. Operation names in this skill generally map to tool names (e.g., "send input" →wsh_send_input). When in doubt, list your availablewsh_*tools.- If you do NOT have
wsh_*tools: you are in HTTP/curl fallback mode. DO NOT GUESS endpoints or CLI subcommands. Load the full API reference first: search your workspace forskills/core/and readSKILL.md. It contains every endpoint with working curl examples and a bootstrap sequence.- Quick bootstrap:
curl -sf --unix-socket ${XDG_RUNTIME_DIR:-/tmp}/wsh/default.http.sock http://localhost/health— if that fails:wsh server -L agent-$$ --ephemeral &and retry.
Ansible, Puppet, and Chef exist because you can't interactively operate 50 machines at once. They solve this with declarative configs and idempotent operations -- you describe the desired state, the tool converges toward it. This works brilliantly for the 80% of infrastructure work that's predictable and repeatable.
But 20% of real infrastructure work isn't predictable. Interactive installers that ask questions nobody anticipated. Package upgrades that present merge conflicts in config files. Services that fail in ways that require investigation, not just a restart. Approval prompts that need human judgment. Diagnostics that require poking around, reading logs, trying things.
wsh changes the equation. An AI agent can sit at every terminal simultaneously -- reading screens, handling prompts, making decisions. This enables an imperative, interactive model for the work that declarative tools can't handle. Not a replacement for Ansible. A complement for the cases where Ansible isn't enough.
Use infrastructure-ops when:
Don't use infrastructure-ops when:
parallel-ssh or ansible -m shellYou need a federated wsh cluster: a hub server with backends registered for each target machine. Each backend is a wsh server running on the target host.
list servers
# Expect: hub (local), plus one backend per target hostIf backends aren't registered yet, add them:
add server at address http://10.0.1.10:8080
add server at address http://10.0.1.11:8080
add server at address http://10.0.1.12:8080
# Wait for all to become healthy
loop:
list servers
if all target servers are healthy: break
wait briefly
retrySee the wsh:cluster-orchestration skill for details on server registration, health monitoring, and authentication.
You also need the patterns from wsh:drive-process (the send/wait/read loop) and wsh:multi-session (parallel session management, tagging, fan-out). This skill composes on top of both.
The fundamental infrastructure pattern: run the same operation on
every host, but handle each host's response individually. This is
different from parallel-ssh because you react to what each host
does -- you don't just fire and forget.
The safe default. Operate on one host at a time, verify success before moving to the next:
hosts = ["web-1", "web-2", "web-3", "web-4", "web-5"]
results = {}
for host in hosts:
create session "op-{host}" on server "{host}"
send to "op-{host}": sudo systemctl restart myapp\n
wait for idle on "op-{host}"
read screen from "op-{host}"
if password prompt detected:
send to "op-{host}": {sudo_password}\n
wait for idle on "op-{host}"
read screen from "op-{host}"
if success:
results[host] = "ok"
else:
results[host] = "failed"
# Decide: continue to next host, or stop?
kill session "op-{host}"
report resultsSequential is slower but safer. If the operation breaks the first host, you haven't touched the other four.
When the operation is safe to run everywhere simultaneously -- read-only commands, status checks, log collection:
hosts = ["web-1", "web-2", "web-3", "web-4", "web-5"]
# Launch all at once
for host in hosts:
create session "check-{host}" on server "{host}", tagged: fleet-check
send to "check-{host}": systemctl status myapp\n
# Poll for completion
completed = {}
while not all hosts completed:
wait for idle on sessions tagged "fleet-check" (timeout 2000ms)
# Returns whichever session settled first
read screen from that session
record result
mark as completed
# Use last_session + last_generation to advance
# Clean up
for host in hosts:
kill session "check-{host}"
report resultsThe real reason this pattern exists. When you run apt upgrade
across 10 servers, each one may present different prompts:
A declarative tool must handle all these cases upfront with configuration flags. An agent reads each screen and responds with judgment:
read screen from "upgrade-{host}"
if "keep or replace" in screen:
# Inspect the diff, decide based on whether we've
# customized this config
send: d\n # show diff first
wait, read
if config has local customizations:
send: N\n # keep ours
else:
send: Y\n # take the new version
if "restart now" in screen:
# Check if this is a load-balanced service that can restart
send: y\n
if "reboot required" in screen:
# Note for later -- don't reboot mid-upgrade-sweep
results[host].needs_reboot = trueThis per-host judgment is the core value proposition. The agent adapts to what each host actually presents.
Deploy to a fleet incrementally, verifying health between each batch. If something goes wrong, stop before it affects the whole fleet.
hosts = ["web-1", "web-2", "web-3", "web-4", "web-5", "web-6"]
batch_size = 2
failed = []
for batch in chunk(hosts, batch_size):
# Deploy this batch in parallel
for host in batch:
create session "deploy-{host}" on server "{host}", tagged: deploy
send to "deploy-{host}": cd /opt/myapp && ./deploy.sh v2.1.0\n
# Wait for all in this batch to finish
for host in batch:
wait for idle on "deploy-{host}" (timeout 30000ms)
read screen from "deploy-{host}"
if deployment prompt detected:
handle prompt (see per-host prompt handling)
wait for idle on "deploy-{host}"
read screen from "deploy-{host}"
if deployment failed:
failed.append(host)
# Health check this batch before proceeding
for host in batch:
create session "health-{host}" on server "{host}", tagged: health
send to "health-{host}": curl -sf http://localhost:8080/health\n
wait for idle on "health-{host}"
read screen from "health-{host}"
if health check failed:
failed.append(host)
kill session "health-{host}"
# Clean up deploy sessions for this batch
for host in batch:
kill session "deploy-{host}"
# Gate: stop if this batch had failures
if failed is not empty:
report: "Stopping rolling deploy. Failed hosts: {failed}"
report: "Remaining hosts not deployed: {remaining}"
break
# Optional: wait for the new version to soak
wait 30 seconds
# Re-check health after soak period
for host in batch:
create session "soak-{host}" on server "{host}"
send to "soak-{host}": curl -sf http://localhost:8080/health\n
wait for idle, read screen
if unhealthy:
failed.append(host)
kill session "soak-{host}"
if failed is not empty:
report: "Soak check failed. Stopping."
break
if failed is empty:
report: "Rolling deploy complete. All hosts healthy."A special case of rolling deployment: deploy to one host first, soak for longer, then proceed if healthy:
canary = hosts[0]
rest = hosts[1:]
# Deploy to canary
create session "deploy-canary" on server "{canary}"
send to "deploy-canary": ./deploy.sh v2.1.0\n
wait for idle, read screen, handle prompts
kill session "deploy-canary"
# Extended health check on canary
for i in range(5):
wait 60 seconds
create session "canary-check" on server "{canary}"
send to "canary-check": curl -sf http://localhost:8080/health && echo "HEALTHY" || echo "UNHEALTHY"\n
wait for idle, read screen
if "UNHEALTHY" in screen:
report: "Canary failed health check. Aborting."
initiate rollback on canary
stop
kill session "canary-check"
# Canary is healthy after 5 minutes. Deploy the rest.
proceed with rolling deploy on restInspect current configuration, apply changes, and verify the result -- interactively, with the ability to inspect and judge at each step.
The fundamental configuration pattern:
for host in hosts:
create session "config-{host}" on server "{host}"
# 1. Inspect current state
send to "config-{host}": cat /etc/myapp/config.yaml\n
wait for idle, read screen
record current_config[host]
# 2. Decide whether to modify
if config needs updating:
# Back up first
send: cp /etc/myapp/config.yaml /etc/myapp/config.yaml.bak\n
wait for idle
# Apply the change (using sed, patch, or writing a new file)
send: sed -i 's/max_connections: 100/max_connections: 200/' /etc/myapp/config.yaml\n
wait for idle
# 3. Verify the change
send: cat /etc/myapp/config.yaml\n
wait for idle, read screen
if change is correct:
results[host] = "updated"
else:
# Restore backup
send: cp /etc/myapp/config.yaml.bak /etc/myapp/config.yaml\n
wait for idle
results[host] = "reverted -- change didn't apply correctly"
else:
results[host] = "no change needed"
kill session "config-{host}"Compare configurations across the fleet to find hosts that have drifted from the expected state:
expected_config = "..." # known-good configuration
drifted = []
for host in hosts:
create session "audit-{host}" on server "{host}", tagged: audit
send to "audit-{host}": cat /etc/myapp/config.yaml\n
for host in hosts:
wait for idle on "audit-{host}"
read screen from "audit-{host}"
if screen does not match expected_config:
drifted.append(host)
# Optionally: read scrollback for the full config if it's long
kill session "audit-{host}"
if drifted:
report: "Configuration drift detected on: {drifted}"
# Inspect each drifted host for detailsSome services have their own interactive configuration tools --
mysql_secure_installation, dpkg-reconfigure, certbot,
database migration wizards. These can't be driven by declarative
tools, but an agent can handle them:
create session "certbot-{host}" on server "{host}"
send to "certbot-{host}": sudo certbot --nginx\n
wait for idle, read screen
# Certbot asks a series of questions
if "Enter email address" in screen:
send: admin@example.com\n
wait for idle, read screen
if "Terms of Service" in screen:
send: A\n # Agree
wait for idle, read screen
if "Which names would you like to activate HTTPS for" in screen:
send: 1\n # Select the first domain
wait for idle, read screen
# Verify the certificate was issued
send: sudo certbot certificates\n
wait for idle, read screen
verify certificate is present and valid
kill session "certbot-{host}"Package upgrades are one of the most common infrastructure tasks, and one of the most prone to interactive surprises.
hosts = ["web-1", "web-2", "web-3", "db-1", "db-2"]
for host in hosts:
create session "upgrade-{host}" on server "{host}"
# Update package lists
send to "upgrade-{host}": sudo apt update\n
wait for idle on "upgrade-{host}" (timeout 30000ms)
read screen
if password prompt:
send: {sudo_password}\n
wait for idle, read screen
# Check what would be upgraded (dry run)
send: apt list --upgradable\n
wait for idle, read screen
record pending_upgrades[host]
# Review: show the operator what each host needs
report pending upgrades per host
# At this point, decide whether to proceed
# Execute upgrades
for host in hosts:
send to "upgrade-{host}": sudo apt upgrade -y\n
wait for idle (timeout 60000ms)
read screen
# Handle the prompts that -y doesn't suppress
loop until shell prompt returns:
if "Configuration file" in screen:
# Package wants to overwrite a config file
# Decide based on whether we maintain custom configs
send: N\n # keep current version
wait for idle, read screen
if "Daemons using outdated libraries" in screen:
send: \n # accept defaults
wait for idle, read screen
if "restart services" in screen:
send: y\n
wait for idle, read screen
if shell prompt detected:
break
wait for idle (timeout 5000ms)
read screen
# Verify
send: apt list --upgradable\n
wait for idle, read screen
if packages still pending:
results[host] = "partial upgrade"
else:
results[host] = "fully upgraded"
kill session "upgrade-{host}"Different hosts may use different package managers. Detect and adapt:
send to "detect-{host}": which apt yum dnf apk 2>/dev/null\n
wait for idle, read screen
if "apt" in screen:
update_cmd = "sudo apt update && sudo apt upgrade -y"
elif "dnf" in screen:
update_cmd = "sudo dnf upgrade -y"
elif "yum" in screen:
update_cmd = "sudo yum update -y"
elif "apk" in screen:
update_cmd = "sudo apk update && sudo apk upgrade"
send to "upgrade-{host}": {update_cmd}\nWhen you need to install a specific package with version pinning:
for host in hosts:
create session "install-{host}" on server "{host}", tagged: install
send to "install-{host}": sudo apt install nginx=1.24.0-1~jammy\n
wait for idle, read screen
if "Do you want to continue?" in screen:
send: Y\n
wait for idle, read screen
if "unable to locate package" in screen or "has no installation candidate" in screen:
results[host] = "package not available"
elif "is already the newest version" in screen:
results[host] = "already installed"
else:
# Verify installation
send: nginx -v\n
wait for idle, read screen
if "1.24.0" in screen:
results[host] = "installed"
else:
results[host] = "unexpected version"
kill session "install-{host}"Post-operation checks are what separate a careful deployment from a reckless one. Always verify after making changes.
create session "health-{host}" on server "{host}"
# Check the service is running
send to "health-{host}": systemctl is-active myapp\n
wait for idle, read screen
service_running = "active" in screen
# Check the endpoint responds
send: curl -sf -o /dev/null -w '%{http_code}' http://localhost:8080/health\n
wait for idle, read screen
endpoint_healthy = "200" in screen
# Check recent logs for errors
send: journalctl -u myapp --since '5 minutes ago' --no-pager | grep -i error | tail -5\n
wait for idle, read screen
recent_errors = screen contains error lines
# Check resource usage
send: systemctl show myapp --property=MemoryCurrent,CPUUsageNSec\n
wait for idle, read screen
record resource metrics
kill session "health-{host}"
health[host] = {
running: service_running,
endpoint: endpoint_healthy,
errors: recent_errors,
resources: metrics
}For critical deployments, go beyond surface-level checks:
# Check the application version matches what we deployed
send: curl -sf http://localhost:8080/version\n
wait for idle, read screen
verify version matches expected
# Check database connectivity
send: curl -sf http://localhost:8080/health/db\n
wait for idle, read screen
# Check dependent services
send: curl -sf http://localhost:8080/health/redis\n
wait for idle, read screen
# Run a smoke test
send: curl -sf -X POST http://localhost:8080/api/test -d '{"ping":"pong"}'\n
wait for idle, read screen
verify expected responseCollect health from all hosts and present a summary:
health_report = {}
for host in hosts:
create session "health-{host}" on server "{host}", tagged: health
send to "health-{host}": curl -sf http://localhost:8080/health && echo "OK" || echo "FAIL"\n
for host in hosts:
wait for idle on "health-{host}"
read screen from "health-{host}"
health_report[host] = "OK" if "OK" in screen else "FAIL"
kill session "health-{host}"
healthy_count = count of "OK" in health_report
report: "{healthy_count}/{total} hosts healthy"
if any "FAIL":
report: "Unhealthy hosts: {list of failed hosts}"When something fails mid-fleet, you need a plan. The rollback strategy depends on what went wrong and how far you got.
Before deploying, ensure you can roll back:
for host in deploy_hosts:
create session "prep-{host}" on server "{host}"
# Record current version
send to "prep-{host}": readlink /opt/myapp/current\n
wait for idle, read screen
previous_version[host] = extract version from screen
# Ensure previous release artifacts exist
send: ls /opt/myapp/releases/\n
wait for idle, read screen
verify previous version directory exists
kill session "prep-{host}"When a rolling deploy fails partway through:
# failed_hosts: hosts where deployment failed
# deployed_hosts: hosts where deployment succeeded
# remaining_hosts: hosts not yet touched
# Step 1: Fix the failed hosts
for host in failed_hosts:
create session "rollback-{host}" on server "{host}"
send to "rollback-{host}": cd /opt/myapp && ./rollback.sh {previous_version[host]}\n
wait for idle, read screen, handle prompts
verify service is healthy
kill session "rollback-{host}"
# Step 2: Decide about successfully deployed hosts
# Option A: Roll them back too (full rollback)
for host in deployed_hosts:
create session "rollback-{host}" on server "{host}"
send to "rollback-{host}": cd /opt/myapp && ./rollback.sh {previous_version[host]}\n
wait for idle, read screen, handle prompts
verify service is healthy
kill session "rollback-{host}"
# Option B: Leave them on the new version (partial deploy)
# Only do this if the new version is working correctly on
# those hosts and the old/new versions are compatible
# Step 3: Remaining hosts were never touched -- no action needed
# Step 4: Verify fleet health
run fleet-wide health check (see Health Verification)Not every failure requires a full rollback. Consider:
Running apt upgrade on 50 hosts simultaneously is tempting
but dangerous. If the upgrade has an unexpected prompt or failure
mode, you'll have 50 hosts in an unknown state. Start sequential,
move to small batches once you've seen the operation succeed on
the first few hosts.
The same command produces different prompts on different hosts depending on installed packages, OS version, configuration, and state. Don't assume that because host 1 had no prompts, host 2 won't either. Always read the screen and handle what's actually there.
Infrastructure operations can create dozens of sessions -- deploy, health check, rollback, audit sessions for each host. Kill sessions as soon as they've served their purpose. A session is a running process on the backend machine. Leaving 50 orphaned sessions across your fleet wastes resources and clutters the session list.
Many infrastructure operations require elevated privileges.
Handle sudo password prompts consistently. If you're working
across many hosts, consider whether NOPASSWD rules for
specific commands are appropriate to avoid interactive password
entry on every host.
Infrastructure changes are auditable events. For each operation, record:
Read scrollback from each session before killing it. This is your audit trail.
wsh + AI is powerful for interactive, judgment-heavy operations. But Ansible is better for:
Use wsh for the 20% that requires interaction and judgment. Use Ansible for the 80% that doesn't. They complement each other.
The worst time to discover your rollback procedure doesn't work is during a failed deployment at 2am. Test rollback on a staging fleet first. Verify that the previous version's artifacts exist, that the rollback script works, and that the health checks pass after rolling back.
If you're deploying across 100 hosts, don't set a batch size of 50. A failed batch of 50 means half your fleet is down. Start small. Increase batch size only after the first few batches succeed. A common progression: 1, 2, 5, 10, 25, 50.
4863aaf
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.