cluster-orchestration

Use when you need to manage sessions across multiple wsh servers in a federated cluster. Examples: "distribute builds across several machines", "create sessions on a specific backend", "monitor health across a cluster of servers", "coordinate work across server boundaries".

Quality

76%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Risky

Do not use without reviewing

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/cluster-orchestration/SKILL.md

IMPORTANT: EXECUTION CONTEXT This skill describes what to do — domain patterns and decision-making. It does NOT describe how to call the API.

If you have wsh_* tools (check your toolkit for wsh_send_input, wsh_get_screen, etc.): use them directly. Operation names in this skill generally map to tool names (e.g., "send input" → wsh_send_input). When in doubt, list your available wsh_* tools.

If you do NOT have wsh_* tools: you are in HTTP/curl fallback mode. DO NOT GUESS endpoints or CLI subcommands. Load the full API reference first: search your workspace for skills/core/ and read SKILL.md. It contains every endpoint with working curl examples and a bootstrap sequence.

Quick bootstrap: curl -sf --unix-socket ${XDG_RUNTIME_DIR:-/tmp}/wsh/default.http.sock http://localhost/health — if that fails: wsh server -L agent-$$ --ephemeral & and retry.

wsh:cluster-orchestration — Distributed Terminal Sessions

Sometimes one machine isn't enough. You need to spread work across multiple servers — running builds on beefy hardware, tests in isolated environments, deployments on production hosts. Cluster orchestration lets you manage sessions across a fleet of wsh servers from a single hub.

When to Use Cluster Orchestration

Use cluster orchestration when:

Work needs to run on specific machines (different hardware, different networks, different environments)
You need to scale beyond what one machine can handle
Sessions need isolation across physical or virtual boundaries
You're coordinating a distributed workflow (build here, test there, deploy somewhere else)

Don't use cluster orchestration when:

All your work fits on one machine — use multi-session instead
You only need one session — use the basic session primitives
The tasks don't benefit from distribution

Concepts

Hub and Backends

A cluster has one hub server and one or more backend servers. The hub is the server you talk to — it receives your requests and either handles them locally or forwards them to the right backend. Backends are regular wsh servers that the hub knows about.

You interact exclusively with the hub. The hub handles routing, health monitoring, and aggregation transparently. From your perspective, it looks like one server with sessions spread across multiple machines.

Server Identity

Every server in the cluster has a hostname — a unique identifier. Backends acquire their hostname automatically when they connect to the hub (the hub queries each backend's identity). The hub's own hostname is its system hostname or a configured override.

You use hostnames to target specific servers when creating sessions or querying state.

Health Monitoring

The hub continuously monitors each backend's health. A backend can be in one of three states:

healthy — connected and responding normally
connecting — initial connection in progress or reconnecting after a disruption
unavailable — connection lost, not responding

Only healthy backends participate in session operations. The hub automatically reconnects to backends that become unavailable, so transient network issues resolve on their own.

Server Registration and Monitoring

Discovering the Cluster

Before creating sessions on remote servers, check what's available:

list servers

This returns every server in the cluster, including the hub itself. Each entry shows the hostname, health status, and role. The hub always appears as "local" with health "healthy".

Adding a Backend

Register a new backend server with the hub. Backend addresses require a scheme (http:// or https://) and may include a path prefix for reverse-proxy deployments:

add server at address http://10.0.1.10:8080
add server at address https://proxy.example.com/wsh-node-1

The hub immediately begins connecting to the backend. It starts in "connecting" state and transitions to "healthy" once the connection is established and the backend's hostname is resolved.

If the backend requires authentication, provide a token:

add server at address http://10.0.1.10:8080 with token "secret"

Checking a Specific Server

Get detailed status for a single server by hostname:

get server "prod-1"

Removing a Backend

Deregister a backend when it's no longer needed:

remove server "prod-1"

This disconnects from the backend and removes it from the cluster. Sessions that were running on that backend become inaccessible through the hub (they continue running on the backend itself).

Waiting for Backends to Become Healthy

After adding a backend, you need to wait for it to become healthy before creating sessions on it. Poll the server list until the backend's health transitions from "connecting" to "healthy":

loop:
    list servers
    if target server is healthy: break
    wait briefly
    retry

This typically takes a few seconds. Don't proceed with session creation until the backend is healthy — requests to unhealthy backends will fail.

Distributed Session Creation

Creating a Session on a Specific Backend

Target a specific server by hostname when creating a session:

create session "build" on server "prod-1"

The hub forwards the creation request to the named backend. The session runs on that backend's hardware, in its environment, with its resources. All subsequent operations on that session are automatically routed through the hub to the right backend.

Creating Local Sessions

Sessions created without a server target run on the hub itself:

create session "local-work"

This is exactly the same as single-server operation. The hub handles it locally without involving any backend.

Choosing Where to Place Work

Consider these factors when deciding where to create sessions:

Hardware requirements: CPU-intensive builds on powerful machines, memory-heavy tests on high-RAM servers
Network locality: Operations that access local resources (databases, filesystems) should run on the same machine
Isolation: Untrusted or experimental work on dedicated backends, away from production sessions
Load distribution: Spread parallel work across backends to avoid overloading any single machine

Tag-Based Cross-Server Workflows

Tags work transparently across server boundaries. Sessions on different backends can share the same tags, and tag-based queries aggregate results from all healthy servers.

Distributed Fan-Out

Spread parallel work across the cluster using tags to track it:

create session "build-api" on server "build-1", tagged: ci
create session "build-web" on server "build-2", tagged: ci
create session "test-e2e" on server "test-1", tagged: ci

send each session its respective command
wait for idle across sessions tagged "ci"

The idle detection races across all tagged sessions regardless of which server they're on. The first to settle is returned.

Listing Sessions Across Servers

list sessions

Without a server filter, the session list aggregates across all healthy backends plus the hub. Each session in the response includes a server field indicating which server it lives on.

list sessions tagged "ci"

Tag filtering also works across the full cluster. Only sessions matching the tag are returned, from any server.

list sessions on server "build-1"

To see sessions on a specific backend only, filter by server.

Session Operations Are Transparent

Once a session exists, all operations work the same regardless of where it lives. The hub routes requests automatically:

send input to "build-api": cargo build\n
wait for idle on "build-api"
read screen from "build-api"
kill session "build-api"

You don't need to remember which server a session is on. The hub tracks this mapping and routes transparently.

Cross-Server Quiescence Patterns

Waiting for Any Session to Settle

The server-level idle detection races across all sessions in the cluster:

wait for idle across all sessions (timeout 2000ms)

Returns whichever session becomes idle first, including its name and the server it's running on. Use last_session and last_generation to avoid re-returning the same session.

Waiting for a Tagged Subset

wait for idle across sessions tagged "build" (timeout 2000ms)

This is the most common pattern for distributed fan-out. Tag all related work, then poll idle across the group.

Polling a Specific Backend's Sessions

If you need to check just one server's sessions:

list sessions on server "build-1"
for each session:
    wait for idle
    read screen
    check results

Coordinating Sequential Stages

When one stage must complete before the next begins:

# Stage 1: Build on the build server
create session "build" on server "build-1", tagged: pipeline
send to "build": make release\n
wait for idle on "build" (timeout 5000ms)
read screen from "build"
# verify success

# Stage 2: Test on the test server
create session "test" on server "test-1", tagged: pipeline
send to "test": ./run-tests.sh\n
wait for idle on "test" (timeout 5000ms)
read screen from "test"
# verify success

# Stage 3: Deploy on the deploy server
create session "deploy" on server "deploy-1", tagged: pipeline
send to "deploy": ./deploy.sh\n
...

Each stage runs on a different server but follows the same send/wait/read/decide loop.

Failure Handling

Backend Goes Down

When a backend becomes unavailable:

Existing sessions on that backend become inaccessible. Operations targeting those sessions will fail with a server unavailable error.
The hub continues operating normally. Local sessions and sessions on other healthy backends are unaffected.
The hub automatically attempts to reconnect. If the backend comes back, the connection is re-established and its sessions become accessible again.
Session listing excludes unavailable backends. Only sessions from healthy servers appear in aggregated listings.

Recovery Strategies

Check health before critical operations:

list servers
if target server is unavailable:
    fall back to another server or report the failure

Design for partial failure:

When fanning out across multiple backends, some may fail while others succeed. Collect results from the successful ones and handle failures individually rather than treating any failure as a total failure.

results = {}
for each session in the fan-out:
    try:
        wait for idle
        read screen
        results[session] = success
    catch server unavailable:
        results[session] = failed
report partial results

Use tags for recovery:

If a backend fails mid-workflow, you can recreate the affected sessions on a different backend with the same tags:

# Original session on failed backend
# create session "build" on server "build-1", tagged: ci

# Recovery: recreate on another backend
create session "build-retry" on server "build-2", tagged: ci
send to "build-retry": (same command)

Session Lifetime and Server Lifetime

Sessions are owned by the backend they run on. If you remove a backend from the cluster, its sessions continue running on that machine — they just become unreachable through the hub. If the backend process exits, its sessions end.

The hub doesn't migrate sessions. If a backend goes down and its sessions are lost, you need to recreate them on another backend.

Pitfalls

Don't Over-Distribute

Distribution adds latency and complexity. Every request to a remote session goes through the hub to the backend and back. If all your work can run on one machine, use multi-session on a single server.

Monitor Backend Health

Don't assume backends are always available. Check health before starting critical workflows, and design for graceful degradation when backends fail.

Clean Up Remote Sessions

Remote sessions consume resources on backend machines. Clean up after yourself — don't leave orphaned sessions running on backends. The hub won't automatically kill sessions when you remove a backend.

Backend Authentication

Backends may require authentication tokens. Ensure tokens are configured correctly when adding backends. Without proper authentication, the hub won't be able to connect.

IP Access Control

When the hub has an [ip_access] section in its federation config, backend addresses are checked against the blocklist and allowlist at registration time. Backends whose resolved IPs fall outside the allowed ranges will be rejected. There is no hardcoded blocklist -- the operator owns the threat model.

Hostname Uniqueness

Every server in the cluster must have a unique hostname. If two backends have the same hostname, the second registration will be rejected. Configure unique hostnames for each backend if the system hostnames collide.

Repository: deepgram/wsh
Commit: 4863aaf

Last updated: 10 days ago
Created: 10 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.