Use when you need to manage sessions across multiple wsh servers in a federated cluster. Examples: "distribute builds across several machines", "create sessions on a specific backend", "monitor health across a cluster of servers", "coordinate work across server boundaries".
64
76%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/cluster-orchestration/SKILL.mdIMPORTANT: EXECUTION CONTEXT This skill describes what to do — domain patterns and decision-making. It does NOT describe how to call the API.
- If you have
wsh_*tools (check your toolkit forwsh_send_input,wsh_get_screen, etc.): use them directly. Operation names in this skill generally map to tool names (e.g., "send input" →wsh_send_input). When in doubt, list your availablewsh_*tools.- If you do NOT have
wsh_*tools: you are in HTTP/curl fallback mode. DO NOT GUESS endpoints or CLI subcommands. Load the full API reference first: search your workspace forskills/core/and readSKILL.md. It contains every endpoint with working curl examples and a bootstrap sequence.- Quick bootstrap:
curl -sf --unix-socket ${XDG_RUNTIME_DIR:-/tmp}/wsh/default.http.sock http://localhost/health— if that fails:wsh server -L agent-$$ --ephemeral &and retry.
Sometimes one machine isn't enough. You need to spread work across multiple servers — running builds on beefy hardware, tests in isolated environments, deployments on production hosts. Cluster orchestration lets you manage sessions across a fleet of wsh servers from a single hub.
Use cluster orchestration when:
Don't use cluster orchestration when:
A cluster has one hub server and one or more backend servers. The hub is the server you talk to — it receives your requests and either handles them locally or forwards them to the right backend. Backends are regular wsh servers that the hub knows about.
You interact exclusively with the hub. The hub handles routing, health monitoring, and aggregation transparently. From your perspective, it looks like one server with sessions spread across multiple machines.
Every server in the cluster has a hostname — a unique identifier. Backends acquire their hostname automatically when they connect to the hub (the hub queries each backend's identity). The hub's own hostname is its system hostname or a configured override.
You use hostnames to target specific servers when creating sessions or querying state.
The hub continuously monitors each backend's health. A backend can be in one of three states:
Only healthy backends participate in session operations. The hub automatically reconnects to backends that become unavailable, so transient network issues resolve on their own.
Before creating sessions on remote servers, check what's available:
list serversThis returns every server in the cluster, including the hub itself. Each entry shows the hostname, health status, and role. The hub always appears as "local" with health "healthy".
Register a new backend server with the hub. Backend addresses
require a scheme (http:// or https://) and may include a path
prefix for reverse-proxy deployments:
add server at address http://10.0.1.10:8080
add server at address https://proxy.example.com/wsh-node-1The hub immediately begins connecting to the backend. It starts in "connecting" state and transitions to "healthy" once the connection is established and the backend's hostname is resolved.
If the backend requires authentication, provide a token:
add server at address http://10.0.1.10:8080 with token "secret"Get detailed status for a single server by hostname:
get server "prod-1"Deregister a backend when it's no longer needed:
remove server "prod-1"This disconnects from the backend and removes it from the cluster. Sessions that were running on that backend become inaccessible through the hub (they continue running on the backend itself).
After adding a backend, you need to wait for it to become healthy before creating sessions on it. Poll the server list until the backend's health transitions from "connecting" to "healthy":
loop:
list servers
if target server is healthy: break
wait briefly
retryThis typically takes a few seconds. Don't proceed with session creation until the backend is healthy — requests to unhealthy backends will fail.
Target a specific server by hostname when creating a session:
create session "build" on server "prod-1"The hub forwards the creation request to the named backend. The session runs on that backend's hardware, in its environment, with its resources. All subsequent operations on that session are automatically routed through the hub to the right backend.
Sessions created without a server target run on the hub itself:
create session "local-work"This is exactly the same as single-server operation. The hub handles it locally without involving any backend.
Consider these factors when deciding where to create sessions:
Tags work transparently across server boundaries. Sessions on different backends can share the same tags, and tag-based queries aggregate results from all healthy servers.
Spread parallel work across the cluster using tags to track it:
create session "build-api" on server "build-1", tagged: ci
create session "build-web" on server "build-2", tagged: ci
create session "test-e2e" on server "test-1", tagged: ci
send each session its respective command
wait for idle across sessions tagged "ci"The idle detection races across all tagged sessions regardless of which server they're on. The first to settle is returned.
list sessionsWithout a server filter, the session list aggregates across all
healthy backends plus the hub. Each session in the response
includes a server field indicating which server it lives on.
list sessions tagged "ci"Tag filtering also works across the full cluster. Only sessions matching the tag are returned, from any server.
list sessions on server "build-1"To see sessions on a specific backend only, filter by server.
Once a session exists, all operations work the same regardless of where it lives. The hub routes requests automatically:
send input to "build-api": cargo build\n
wait for idle on "build-api"
read screen from "build-api"
kill session "build-api"You don't need to remember which server a session is on. The hub tracks this mapping and routes transparently.
The server-level idle detection races across all sessions in the cluster:
wait for idle across all sessions (timeout 2000ms)Returns whichever session becomes idle first, including its name
and the server it's running on. Use last_session and
last_generation to avoid re-returning the same session.
wait for idle across sessions tagged "build" (timeout 2000ms)This is the most common pattern for distributed fan-out. Tag all related work, then poll idle across the group.
If you need to check just one server's sessions:
list sessions on server "build-1"
for each session:
wait for idle
read screen
check resultsWhen one stage must complete before the next begins:
# Stage 1: Build on the build server
create session "build" on server "build-1", tagged: pipeline
send to "build": make release\n
wait for idle on "build" (timeout 5000ms)
read screen from "build"
# verify success
# Stage 2: Test on the test server
create session "test" on server "test-1", tagged: pipeline
send to "test": ./run-tests.sh\n
wait for idle on "test" (timeout 5000ms)
read screen from "test"
# verify success
# Stage 3: Deploy on the deploy server
create session "deploy" on server "deploy-1", tagged: pipeline
send to "deploy": ./deploy.sh\n
...Each stage runs on a different server but follows the same send/wait/read/decide loop.
When a backend becomes unavailable:
Check health before critical operations:
list servers
if target server is unavailable:
fall back to another server or report the failureDesign for partial failure:
When fanning out across multiple backends, some may fail while others succeed. Collect results from the successful ones and handle failures individually rather than treating any failure as a total failure.
results = {}
for each session in the fan-out:
try:
wait for idle
read screen
results[session] = success
catch server unavailable:
results[session] = failed
report partial resultsUse tags for recovery:
If a backend fails mid-workflow, you can recreate the affected sessions on a different backend with the same tags:
# Original session on failed backend
# create session "build" on server "build-1", tagged: ci
# Recovery: recreate on another backend
create session "build-retry" on server "build-2", tagged: ci
send to "build-retry": (same command)Sessions are owned by the backend they run on. If you remove a backend from the cluster, its sessions continue running on that machine — they just become unreachable through the hub. If the backend process exits, its sessions end.
The hub doesn't migrate sessions. If a backend goes down and its sessions are lost, you need to recreate them on another backend.
Distribution adds latency and complexity. Every request to a remote session goes through the hub to the backend and back. If all your work can run on one machine, use multi-session on a single server.
Don't assume backends are always available. Check health before starting critical workflows, and design for graceful degradation when backends fail.
Remote sessions consume resources on backend machines. Clean up after yourself — don't leave orphaned sessions running on backends. The hub won't automatically kill sessions when you remove a backend.
Backends may require authentication tokens. Ensure tokens are configured correctly when adding backends. Without proper authentication, the hub won't be able to connect.
When the hub has an [ip_access] section in its federation config,
backend addresses are checked against the blocklist and allowlist
at registration time. Backends whose resolved IPs fall outside the
allowed ranges will be rejected. There is no hardcoded blocklist --
the operator owns the threat model.
Every server in the cluster must have a unique hostname. If two backends have the same hostname, the second registration will be rejected. Configure unique hostnames for each backend if the system hostnames collide.
4863aaf
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.