Fix Temporal Cloud connection, auth, and config problems. Use when users hit login failures, can't connect to Cloud, get x509/TLS errors, have namespace or endpoint mismatches, paste broken SDK connection snippets, are confused about which endpoint to use, see "no pollers" or RESOURCE_EXHAUSTED, struggle with PrivateLink/PSC, or need help setting up a new namespace. Also use for HA namespace failover and DNS issues. Not for worker performance tuning or scaling.
93
92%
Does it follow best practices?
Impact
91%
1.33xAverage score across 3 eval scenarios
Passed
No known issues
Help users diagnose and resolve Temporal Cloud connectivity, authentication, and configuration issues using tcld and temporal CLI.
Cloud issues are frustrating because they sit at the intersection of configuration, networking, authentication, and Temporal-specific code. Most problems fall into predictable patterns. This skill provides systematic diagnosis to quickly identify root causes and prescribe fixes.
References:
references/cloud-troubleshooting-reference.md for full CLI command reference and error codesreferences/common-scenarios.md for step-by-step setup walkthroughsOut of scope: Worker performance tuning, scaling, metrics interpretation, SDK-specific config, deployment patterns. Those topics are covered by separate worker-focused skills.
| Category | Key Symptoms | First Check |
|---|---|---|
| tcld Login | login failed, token refresh failed, wrong account | tcld account get |
| Connection/Auth | can't connect, access denied, handshake failures | Endpoint format + DNS + port connectivity |
| Ambiguous Runtime Errors | context deadline exceeded, workflow is busy | Identify the operation and layer first |
| mTLS/Certs | x509 errors, unknown authority, expired | openssl x509 -enddate |
| Namespace | namespace not found, SNI mismatch | Namespace name format |
| HA / Failover | Failover not working, wrong region, DNS stale | DNS CNAME resolution |
| Worker | Tasks not picked up, stale connections | temporal task-queue describe |
| Private Connectivity | PrivateLink/PSC errors | VPC endpoint status |
| Rate Limiting | RESOURCE_EXHAUSTED | APS limits |
Ask the user:
For SDK/client snippet reviews:
HostPort / address are you using?temporal CLI, or tcld?For tcld issues:
tcld account get?For connection issues:
HostPort?For ambiguous runtime errors:
For certificate issues:
For worker issues:
temporal task-queue describe show?Use the appropriate decision tree based on category (see below).
Give specific commands to resolve the issue, with verification steps.
Always include a confidence score for the proposed diagnosis or fix:
Confidence: 9-10/10 when the symptom, operation, and confirming signals line up cleanlyConfidence: 6-8/10 when the evidence is good but one plausible alternative remainsConfidence: 1-5/10 when the issue is still ambiguous and the "fix" is really the next discriminating checkIf the problem is ambiguous, say so explicitly and keep the recommendation scoped to the next check rather than presenting a speculative root cause as settled.
Symptom: tcld login not working
│
├─ Can `tcld account get` run?
│ ├─ Yes → Login is valid; continue with account verification
│ └─ No → Run `tcld login`
│
├─ Token refresh failed?
│ └─ tcld logout && tcld login
│
├─ Wrong organization/account?
│ ├─ tcld account get
│ └─ Verify the expected namespace appears in `tcld namespace list`
│
└─ "unauthorized" or auth errors?
└─ tcld logout && tcld loginDocs: Environment configuration - SDK connection options
Endpoint check before network debugging:
| Use case | Recommended endpoint | Notes |
|---|---|---|
| Workers & clients (all auth) | <namespace>.<account>.tmprl.cloud:7233 | Namespace Endpoint - works for both mTLS and API key auth. Recommended for all namespaces. |
| Multi-region HA (advanced) | <region>.<cloud_provider>.api.temporal.io:7233 | Regional Endpoint - only needed for advanced HA routing. See namespace access docs. |
| tcld / Cloud Ops API | saas-api.tmprl.cloud | Control plane |
Exception: Namespaces using Flexible Auth (pre-release) cannot use Namespace Endpoints yet.
Symptom: Can't connect to Temporal Cloud
│
├─ Check: Using Namespace Endpoint?
│ ├─ Using regional endpoint (`*.api.temporal.io`) without HA need?
│ │ └─ Switch to Namespace Endpoint (`<ns>.<acct>.tmprl.cloud:7233`)
│ ├─ Using old/stale endpoint format?
│ │ └─ Switch to Namespace Endpoint
│ └─ Endpoint looks correct → Continue
│
├─ Check: DNS resolution
│ └─ nslookup <host-from-address>
│ ├─ Fails → DNS issue (check network, VPN)
│ └─ Succeeds → Continue
│
├─ Check: Port connectivity
│ └─ nc -zv <host-from-address> 7233
│ ├─ Fails → Firewall blocking port 7233
│ └─ Succeeds → Continue
│
├─ Check: TLS handshake
│ └─ openssl s_client -connect <address>
│ ├─ Fails → Certificate issue (see mTLS tree)
│ └─ Succeeds → Continue
│
└─ Check: Temporal CLI test
└─ temporal workflow list --limit 1 --address ...
├─ PERMISSION_DENIED → Check namespace name format
├─ UNAUTHENTICATED → Certificate not accepted
└─ Works → Connection OK, issue elsewhereDo not assume these are pure connectivity failures. Classify them by operation first.
| Error text | Common interpretations | First discriminator |
|---|---|---|
context deadline exceeded | wrong endpoint, network timeout, oversized payload, blocked execution path, client-side timeout | Where in the flow does it occur? |
workflow is busy / RESOURCE_EXHAUSTED: Workflow is busy | operation-level contention, workload pressure, confusing user-facing error semantics | Which operation returned it? |
no pollers | no connected workers, workers present but misconfigured, stale/misleading metrics | Does temporal task-queue describe show pollers? |
Use this decision sequence:
Symptom: ambiguous runtime error
│
├─ Check: Which operation returned the error?
│ ├─ start / signal / update / query request
│ ├─ poll loop / worker logs
│ └─ UI / metrics only
│
├─ Check: Is work reaching a task queue?
│ ├─ No pollers listed
│ │ └─ Treat as worker connectivity / config until proven otherwise
│ ├─ Pollers listed, backlog growing
│ │ └─ Worker capacity / tuning issue (out of scope for this skill)
│ └─ Pollers listed, no backlog issue
│ └─ Continue
│
├─ For `context deadline exceeded`
│ ├─ Happens before any work starts
│ │ └─ Check endpoint format, auth, proxy, DNS, firewall
│ ├─ Happens on workflow start with large payloads
│ │ └─ Consider payload size / client timeout path
│ └─ Happens during local execution / queries
│ └─ Consider blocked execution path, local activity, or client-side timeout
│
└─ For `workflow is busy`
├─ Identify exact API / operation
├─ Check whether user is conflating this with a generic workflow failure
└─ Explain that the error class is operation-specific before prescribing fixesIf the operation and surrounding signals still do not make the error interpretable, label it as ambiguous and gather more context before prescribing a fix.
When responding, attach a confidence score from 1-10 to the proposed diagnosis or next step. Ambiguous cases should carry a low-confidence score and a narrow next check rather than a broad claimed fix.
When the user pastes SDK config, validate the config itself before suggesting lower-level networking checks.
Review in this order:
HostPort: should be Namespace Endpoint (<ns>.<acct>.tmprl.cloud:7233) for most cases<namespace>.<account-id>)tls.Config{} is normal for API key auth; client cert/key required for mTLSTEMPORAL_ADDRESS, TEMPORAL_NAMESPACE, TEMPORAL_API_KEY, TEMPORAL_TLS_CLIENT_CERT_PATH, TEMPORAL_TLS_CLIENT_KEY_PATHCommon snippet diagnoses:
*.api.temporal.io) when Namespace Endpoint would work → simplify to <ns>.<acct>.tmprl.cloud:7233HostPort + Cloud namespace/auth → missing explicit Cloud endpointIf the snippet is wrong, fix that first. Do not lead with DNS/TLS debugging until the endpoint and namespace are plausible.
Symptom: x509 certificate errors
│
├─ "certificate signed by unknown authority"
│ ├─ Is CA uploaded to namespace?
│ │ └─ tcld namespace accepted-client-ca list --namespace <ns>
│ │ ├─ CA not listed → Add it:
│ │ │ tcld namespace accepted-client-ca add \
│ │ │ --namespace <ns> --ca-certificate-file ca.pem
│ │ └─ CA listed → Cert not signed by that CA
│ │ └─ Verify: openssl verify -CAfile ca.pem client.pem
│ │
│ └─ Self-signed cert without CA?
│ └─ Must use CA-signed certs for Cloud
│
├─ "certificate has expired"
│ ├─ Check expiry: openssl x509 -enddate -noout -in cert.pem
│ └─ Generate new cert:
│ tcld generate-certificates end-entity-certificate \
│ --organization <org> --validity-period 365d \
│ --ca-certificate-file ca.pem --ca-key-file ca.key \
│ --certificate-file client.pem --key-file client.key
│
├─ "private key does not match"
│ └─ Wrong key file - verify match:
│ openssl x509 -modulus -noout -in cert.pem | md5
│ openssl rsa -modulus -noout -in key.pem | md5
│
└─ "bad certificate" from server
└─ Server rejected cert - CA not in accepted list
└─ tcld namespace accepted-client-ca list --namespace <ns>Symptom: namespace not found or access denied
│
├─ Check: Namespace name format
│ ├─ Format: <namespace-name>.<account-id>
│ ├─ Example: my-namespace.a1b2c3
│ └─ Wrong format → Use full namespace name
│
├─ Check: Namespace exists
│ └─ tcld namespace list
│ └─ Not listed → Wrong account, or namespace not created
│
├─ Check: Address format
│ ├─ Namespace Endpoint (recommended): <namespace>.<account>.tmprl.cloud:7233
│ ├─ Regional Endpoint (HA only): <region>.<cloud_provider>.api.temporal.io:7233
│ └─ Using wrong or stale endpoint? → switch to Namespace Endpoint
│
└─ Check: User permissions
└─ tcld user list
└─ Verify user has access to namespaceThis skill diagnoses Cloud connectivity issues for workers. Worker performance tuning, scaling, and deployment patterns are out of scope.
Docs: Environment configuration - SDK connection setup
Symptom: Workers not picking up tasks
│
├─ Check: Are workers running?
│ └─ Verify worker process is up, check logs for errors
│
├─ Check: Task queue status
│ └─ temporal task-queue describe --task-queue <queue>
│ ├─ No pollers listed → Workers not connected (Cloud issue)
│ │ ├─ Check task queue name matches
│ │ ├─ Check namespace in worker config
│ │ ├─ Verify SDK connection options (see env config docs)
│ │ └─ Check for connection errors in logs
│ │
│ │ Note: if only metrics or dashboards say "no pollers", verify with CLI or another direct surface before concluding workers are absent.
│ │
│ ├─ Pollers listed but backlog growing
│ │ └─ NOT a Cloud issue (worker scaling/tuning problem)
│ │
│ └─ Pollers listed, no backlog
│ └─ Workers healthy, issue is elsewhere
│
├─ Check: DNS caching (common in K8s)
│ └─ Stale DNS can cause workers to connect to wrong endpoint
│ └─ Restart workers to refresh DNS
│
└─ Check: Rate limiting
└─ RESOURCE_EXHAUSTED in logs?
└─ See Rate Limiting sectionScope clarification:
| Issue Type | In Scope? |
|---|---|
| tcld login, certs, namespace, private connectivity | Yes |
| Worker scaling, metrics, tuning, deployment | No |
| "Workers not picking up tasks" | Yes - diagnose, hand off if not a Cloud issue |
Docs: HA namespace connectivity
HA (multi-region) namespaces use a hierarchical DNS structure:
<ns>.<acct>.tmprl.cloud (CNAME to active region)<region>.region.tmprl.cloudSymptom: HA namespace connectivity or failover issues
│
├─ Check: DNS resolution
│ └─ nslookup <namespace>.tmprl.cloud
│ ├─ Should return CNAME → <region>.region.tmprl.cloud
│ └─ Then resolve to IP address
│
├─ Symptom: Clients not failing over
│ ├─ Check: DNS caching
│ │ ├─ TTL is 15s - clients should converge within 30s
│ │ ├─ Some DNS resolvers cache longer
│ │ └─ Fix: Restart workers to refresh DNS
│ │
│ └─ Check: SDK connection caching
│ └─ Some SDKs cache connections - may need restart
│
├─ Symptom: PrivateLink not working after failover
│ ├─ Check: Regional DNS override configured?
│ │ └─ Need Route 53 private hosted zone for region.tmprl.cloud
│ │ mapping regional endpoints to VPC endpoint IPs
│ │
│ └─ Check: Inter-region connectivity
│ └─ Workers need Transit Gateway or VPC Peering to reach
│ VPC endpoints in both regions
│
├─ Symptom: GCP Private Service Connect not working
│ └─ NOT SUPPORTED: Private connectivity not yet offered for
│ GCP Multi-region Namespaces
│
└─ Symptom: sa-east-1 region issues
└─ NOT SUPPORTED: sa-east-1 not available for Multi-region namespaces (because there are no other regions on the continent)Worker placement for HA:
Symptom: PrivateLink or Private Service Connect not working
│
├─ AWS PrivateLink
│ ├─ Check: VPC endpoint status
│ │ └─ AWS Console → VPC → Endpoints → Status = available
│ │
│ ├─ Check: Security groups
│ │ └─ Allow outbound to endpoint on port 7233
│ │
│ ├─ Check: DNS resolution
│ │ └─ Should resolve to private IP (10.x or 172.x)
│ │
│ └─ Check: Connectivity rules
│ ├─ tcld connectivity-rule list --namespace <ns>
│ └─ If needed, attach rules with:
│ tcld namespace set-connectivity-rules --namespace <ns> --connectivity-rule-ids <id>
│
├─ GCP Private Service Connect
│ ├─ Check: PSC endpoint status
│ │ └─ GCP Console → Network Services → Private Service Connect
│ │
│ ├─ Check: Firewall rules
│ │ └─ Allow egress to PSC endpoint
│ │
│ └─ Check: DNS configuration
│ └─ Cloud DNS zone for tmprl.cloud pointing to PSC
│
└─ General
└─ Verify rule details:
tcld connectivity-rule get --connectivity-rule-id <id>Symptom: RESOURCE_EXHAUSTED errors
│
├─ Check: Which operation is limited?
│ └─ Error message indicates operation type
│
├─ "namespace write ops" exceeded
│ ├─ Too many workflow starts
│ ├─ Too many signals/updates
│ └─ Fix: Add backoff, batch operations
│
├─ "namespace read ops" exceeded
│ ├─ Too many list/query operations
│ └─ Fix: Add caching, reduce polling frequency
│
└─ Poll operations rate limited
├─ Too many pollers across workers
└─ Fix: Reduce MaxConcurrentPollers settingsSee references/common-scenarios.md for step-by-step walkthroughs:
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Wrong namespace format | Using short name instead of name.account-id | Use full namespace from tcld namespace list |
| Self-signed certs | Trying to use self-signed without CA | Generate CA first, sign certs with it, upload CA |
| Regional endpoint when unnecessary | Using *.api.temporal.io when Namespace Endpoint works | Switch to <namespace>.<account>.tmprl.cloud:7233 |
| Old endpoint docs | Following stale examples from before Namespace Endpoints were universal | Use Namespace Endpoint: <ns>.<acct>.tmprl.cloud:7233 |
| Expired certs | Not monitoring expiry | Set up alerts, rotate before expiry |
| tcld wrong account | Logged into different org | Use tcld account get, then verify the namespace in tcld namespace list |
| Stale tcld login | Cached auth state is no longer valid | tcld logout && tcld login |
| DNS caching | K8s pods caching old DNS | Restart pods after endpoint changes |
| Missing port | Firewall blocks 7233 | Ensure egress allowed on port 7233 |
<ns>.<acct>.tmprl.cloud:7233 works for both mTLS and API key auth2d867d1
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.