CtrlK
BlogDocsLog inGet started
Tessl Logo

debug

Investigate stuck runs and execution failures by tracing Symphony and Codex logs with issue/session identifiers; use when runs stall, retry repeatedly, or fail unexpectedly.

88

1.00x
Quality

88%

Does it follow best practices?

Impact

83%

1.00x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Debug

Goals

  • Find why a run is stuck, retrying, or failing.
  • Correlate Linear issue identity to a Codex session quickly.
  • Read the right logs in the right order to isolate root cause.

Log Sources

  • Primary runtime log: log/symphony.log
    • Default comes from SymphonyElixir.LogFile (log/symphony.log).
    • Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
  • Rotated runtime logs: log/symphony.log*
    • Check these when the relevant run is older.

Correlation Keys

  • issue_identifier: human ticket key (example: MT-625)
  • issue_id: Linear UUID (stable internal ID)
  • session_id: Codex thread-turn pair (<thread_id>-<turn_id>)

elixir/docs/logging.md requires these fields for issue/session lifecycle logs. Use them as your join keys during debugging.

Quick Triage (Stuck Run)

  1. Confirm scheduler/worker symptoms for the ticket.
  2. Find recent lines for the ticket (issue_identifier first).
  3. Extract session_id from matching lines.
  4. Trace that session_id across start, stream, completion/failure, and stall handling logs.
  5. Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.

Commands

# 1) Narrow by ticket key (fastest entry point)
rg -n "issue_identifier=MT-625" log/symphony.log*

# 2) If needed, narrow by Linear UUID
rg -n "issue_id=<linear-uuid>" log/symphony.log*

# 3) Pull session IDs seen for that ticket
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u

# 4) Trace one session end-to-end
rg -n "session_id=<thread>-<turn>" log/symphony.log*

# 5) Focus on stuck/retry signals
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*

Investigation Flow

  1. Locate the ticket slice:
    • Search by issue_identifier=<KEY>.
    • If noise is high, add issue_id=<UUID>.
  2. Establish timeline:
    • Identify first Codex session started ... session_id=....
    • Follow with Codex session completed, ended with error, or worker exit lines.
  3. Classify the problem:
    • Stall loop: Issue stalled ... restarting with backoff.
    • App-server startup: Codex session failed ....
    • Turn execution failure: turn_failed, turn_cancelled, turn_timeout, or ended with error.
    • Worker crash: Agent task exited ... reason=....
  4. Validate scope:
    • Check whether failures are isolated to one issue/session or repeating across multiple tickets.
  5. Capture evidence:
    • Save key log lines with timestamps, issue_identifier, issue_id, and session_id.
    • Record probable root cause and the exact failing stage.

Reading Codex Session Logs

In Symphony, Codex session diagnostics are emitted into log/symphony.log and keyed by session_id. Read them as a lifecycle:

  1. Codex session started ... session_id=...
  2. Session stream/lifecycle events for the same session_id
  3. Terminal event:
    • Codex session completed ..., or
    • Codex session ended with error ..., or
    • Issue stalled ... restarting with backoff

For one specific session investigation, keep the trace narrow:

  1. Capture one session_id for the ticket.
  2. Build a timestamped slice for only that session:
    • rg -n "session_id=<thread>-<turn>" log/symphony.log*
  3. Mark the exact failing stage:
    • Startup failure before stream events (Codex session failed ...).
    • Turn/runtime failure after stream events (turn_* / ended with error).
    • Stall recovery (Issue stalled ... restarting with backoff).
  4. Pair findings with issue_identifier and issue_id from nearby lines to confirm you are not mixing concurrent retries.

Always pair session findings with issue_identifier/issue_id to avoid mixing concurrent runs.

Notes

  • Prefer rg over grep for speed on large logs.
  • Check rotated logs (log/symphony.log*) before concluding data is missing.
  • If required context fields are missing in new log statements, align with elixir/docs/logging.md conventions.
Repository
openai/symphony
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.